Statistics Without Tears: causation

Showing posts with label causation. Show all posts

Sunday, September 18, 2011

Associations Between Two Variables

Up till now, we've been analyzing one set of quantitative data values, discussing center, shape, spread, and their measures. Sometimes two different sets of quantitative data values have something to do with each other, or so we suspect. This post (and a few of those that follow), deal with how to analyze two sets of data that appear to be related to each other.

We all know now that smoking causes lung cancer. This fact took years and lots of statistical work to establish. The first step was to see if there was a connection between smoking and lung cancer. This was done by charting how many people smoked and how many people got lung cancer. This is where we'll start, only we'll do it with more current data for purposes of illustration. Below is a table showing the incidence rates of smoking and lung cancer per 100,000 people, as tracked by the Centers for Disease Control and Prevention, from 1999 - 2007.

Because smoking explains lung cancer incidence and not the other way around, we call the smoking rate variable the explanatory variable. The lung cancer rate variable is called the response variable.

Looking over this table, we can see that smoking rates -- and lung cancer rates -- have both decreased during these 9 years. To illustrate how we deal with two variables at a time, let's continue with this data. The first thing to do is to plot the data on an xy-grid, one point per year. For 1999, for example, we plot the point (23.3, 93.5). Continuing with all the points, we get a scatterplot:

Note that the explanatory variable is on the x-axis, while the response variable is on the y-axis.

When you view a scatterplot, you are looking for a straight-line, or linear, trend. What I do is sketch the narrowest possible oval around the dots. The narrower the oval, the stronger the linear relationship between the variables. As you can see below, the relationship between smoking and lung cancer is quite strong because the oval is skinny:

If our oval had looked more like a circle, we would conclude that there is no relationship. A fatter oval that has a discernable upward or downward direction would indicate a weak association.

This upward-reaching oval also tells us that the association is positive; that is, that as one variable increases, so does the other one. A downward-reaching oval would indicate a negative association, which means that as one variable increases, the other decreases.

When describing an association between two quantitative variables, we address form, strength, and direction. The form is linear, strength is strong, and direction is positive. So we would say, "The association between smoking and lung cancer between 1999 and 2007 is strong, positive, and linear."

Just so you know, there are other forms of association between two quantitative variables: quadratic (U-shaped), exponential, logarithmic, etc. But we limit ourselves for now to linear associations.

Another *really* important thing to remember is that just because there is a linear association between two quantitative variables, it doesn't necessarily mean that one variable causes the other. We know that in the case of smoking and lung cancer, there is a cause-effect relationship, but this was established by several controlled statistical experiments. Seeing the linear association was only the catalyst for further study...it wasn't the culmination!

In the next post, we'll continue with this same example to develop some of the finer points of analyzing associations between two quantitative variables. For now, here is a good term to know: Another way to say the end of that previous sentence is "...analyzing associations for bivariate data." Bivariate simply means "two variables."

Wednesday, August 10, 2011

Has the Stock Market Hit Bottom?

Oh, if only there were a nice statistical way to answer that question! In the August 10, 2011 issue of USA Today, the article at http://www.usatoday.com/money/markets/2011-08-09-has-market-hit-bottom_n.htm offered a lot of insights from experts, but the authors were wise enough to know that there’s just no telling whether the stock market will rise or fall, or whether better days are to come yet.

For anyone who’s not aware, the Dow Jones Industrial average (DJIA) plummeted 535 points on Monday, following the bad news that the US credit rating had dropped from AAA to AA+, according to Standard & Poors and probably a multitude of other events that stripped investors of their confidence. The following day, August 9, the DJIA gained back 435 points, bringing it closer to pre-plummet levels. At this writing (Wednesday, August 10, 2011 at 1:35 p.m.), the Dow is down about 357 points.

The insights offered by experts as to why the market is plummeting range from emotional sell-offs in light of the US credit rating downgrade, to the state of theoverall world economy, to the lack of warm fuzzies in Congress, but no one is making a the mistake of trying to predict the future. Nothing, not even statistics, can do that reliably.

Theoretically, even if somehow we were able to identify *all* of the factors (“variables” we call them) that make for fluctuations in the stock market, and then we found some realistic way to quantify each of the factors, we could come up with a [probably quite complicated] equation involving all of those variables that we could use to predict future behavior, but it would be just that – a prediction. You simply cannot use statistics to foretell the future. The best one can do is make a fact-based, educated guess.

Similarly, you cannot use the past to predict the future. Most of fall into this trap pretty easily. For example, a lot of people moved their investments to Treasury-related securities, because Treasury issues have shown a steadily upward trend without the volatility of riskier investments, like stocks. In fact, in the past 10 years, inflation-protected securities (TIPS) have out-performed the Dow, the Nasdaq, and the S&P 500 significantly…at about half the risk (as rated by Vanguard). So are you rushing to your favorite investment site to move your money? If you are, then you’re using past performance to predict the future! Sometimes it works, but just as often it doesn’t.

Has the stock market hit bottom? Read the article, and you be the judge. Just trying to set your expectations about what statistics can and cannot do!

Saturday, July 30, 2011

Connection or Causation?

I was reading an article the other day about a possible connection between mental illness and nose jobs, and thought, "Aha! A perfect possible example of crazy statistics!" But I was not only disappointed on that count, but also pleasantly surprised that the article showed valid statistical practices. What I can focus on in this post are two things:

What valid statistical practices were used?
How to avoid common pitfalls when reading an article that involves statistics.

First, you might want to quickly read the article:
http://well.blogs.nytimes.com/2011/07/27/some-nose-job-patients-may-have-mental-illness/
Go ahead; I'll wait.

There are enough bad statistical articles out there, which I'm sure I'll happen upon and grab for another post, but as I said, this article is not one of them. First, the author was careful to qualify the difference between people who have valid medical reasons to seek plastic surgery on their noses and those who appear to be negatively obsessed with a perfectly normal-looking (from a plastic surgeon's standpoint) nose. The author was careful to separate the two.

This was a "controlled" study, as statisticians say. First, there was a rather large sample (266) of patients seeking nose jobs in Belgium, who filled out a diagnostic questionnaire that was intended to uncover the condition (called Body Dysmorpphic Disorder (BDD)). The author included all of the vital information -- number of people, location of the study, duration of the study, and a link to the full journal article so that the study could be reproduced if desired. Rather than worrying that a duplicate study will refute the findings of the original study, most statisticians welcome "do-overs" of the study, because it can either strengthen their findings or point out something they might have overlooked. After all, from an ethical standpoint, most researchers are after the truth, so they try to make the specifics as transparent as possible.

Separating those patients who had a "valid" complaint about their nose (for instance, a breathing problem) from those who showed signs of BDD was an excellent example of controlling the study. For example, if they hadn't done this and went on conducting the study, they couldn't have separated the "valid" patients from the BDD patients later. By controlling the study, they were able to determine that only 2% of the "valid" patients showed evidence of BDD while 43% of the "invalid" patients did. If everyone had been lumped together, the results might have been diluted. Researchers need to think of possible variables ahead of time that could muddy the waters later. The variable of "valid" versus "invalid" was one of these. We call these variables lurking variables, because they can work under the surface, making it look like one thing (seeking a nose job) is causing the other (BDD)... or vice versa.

Another interesting and admirable thing the article did was volunteer extra information that the reader might have had upon reading the article; that is, they singled out nose jobs from other plastic surgery as being notable with respect to this connection. They seem to have thought things through very well.

Now, on to you as a reader of such articles. My only caution in this case is that you don't read any type of cause and effect situation into the article. Having BDD doesn't necessarily cause people to seek a nose job, although there seems to be a very strong connection between the two. Further study and repetition of the study would be needed to prove cause and effect. To its credit, the author of this article was careful not to lead you in the causation direction.

Simply put: Just because two variables appear to be connected doesn't prove that one variable causes another. For those readers who recall the "old days" when people smoked cigarettes and didn't know about the dangers of lung cancer, remember how it took years to get even the mildest cautionary note placed on cigarette packs? The first such caution cited a connection between cigarette smoking and lung cancer. After years of further controlled study, researchers were finally able to put stronger cautions on cigarette packs: that cigarette smoking causes lung cancer.

Moral of the Story: This was an example of an article in which the author was very careful not to imply cause and effect. Other articles aren't so clear. As you read articles that involve numbers and statistics, be aware of this and don't jump to causation conclusions!

Self-Test: Name a lurking variable in this statement: "The more firefighters sent to a fire, the more damage is done to the structure."

Answer: Size of the fire is the lurking variable. It influences both variables: the amount of damage done, and the number of firefighters sent to the scene.

(Credit: Stats: Modeling the World, Bock, Velleman, deVeaux. Thanks!

How's it going so far?