Sunday, September 18, 2011

Associations Between Two Variables

Up till now, we've been analyzing one set of quantitative data values, discussing center, shape, spread, and their measures. Sometimes two different sets of quantitative data values have something to do with each other, or so we suspect. This post (and a few of those that follow), deal with how to analyze two sets of data that appear to be related to each other.

We all know now that smoking causes lung cancer. This fact took years and lots of statistical work to establish. The first step was to see if there was a connection between smoking and lung cancer. This was done by charting how many people smoked and how many people got lung cancer. This is where we'll start, only we'll do it with more current data for purposes of illustration. Below is a table showing the incidence rates of smoking and lung cancer per 100,000 people, as tracked by the Centers for Disease Control and Prevention, from 1999 - 2007.
Because smoking explains lung cancer incidence and not the other way around, we call the smoking rate variable the explanatory variable. The lung cancer rate variable is called the response variable.

Looking over this table, we can see that smoking rates -- and lung cancer rates -- have both decreased during these 9 years. To illustrate how we deal with two variables at a time, let's continue with this data. The first thing to do is to plot the data on an xy-grid, one point per year. For 1999, for example, we plot the point (23.3, 93.5). Continuing with all the points, we get a scatterplot:
Note that the explanatory variable is on the x-axis, while the response variable is on the y-axis.

When you view a scatterplot, you are looking for a straight-line, or linear, trend. What I do is sketch the narrowest possible oval around the dots. The narrower the oval, the stronger the linear relationship between the variables. As you can see below, the relationship between smoking and lung cancer is quite strong because the oval is skinny:
If our oval had looked more like a circle, we would conclude that there is no relationship. A fatter oval that has a discernable upward or downward direction would indicate a weak association.

This upward-reaching oval also tells us that the association is positive; that is, that as one variable increases, so does the other one. A downward-reaching oval would indicate a negative association, which means that as one variable increases, the other decreases.

When describing an association between two quantitative variables, we address form, strength, and direction. The form is linear, strength is strong, and direction is positive. So we would say, "The association between smoking and lung cancer between 1999 and 2007 is strong, positive, and linear."

Just so you know, there are other forms of association between two quantitative variables: quadratic (U-shaped), exponential, logarithmic, etc. But we limit ourselves for now to linear associations.

Another *really* important thing to remember is that just because there is a linear association between two quantitative variables, it doesn't necessarily mean that one variable causes the other. We know that in the case of smoking and lung cancer, there is a cause-effect relationship, but this was established by several controlled statistical experiments. Seeing the linear association was only the catalyst for further study...it wasn't the culmination!

In the next post, we'll continue with this same example to develop some of the finer points of analyzing associations between two quantitative variables. For now, here is a good term to know: Another way to say the end of that previous sentence is "...analyzing associations for bivariate data." Bivariate simply means "two variables."