How's it going so far?

Saturday, October 15, 2011

Correlation and Linear Regression

When two quantitative variables have a linear relationship, we classify the association as strong, moderate, or weak. However, we can also quantify the strength and direction of a linear association between two quantitative variables. To do this, we find the correlation.

We'll continue with our example from the previous post: smoking rates versus lung cancer rates from 1999 through 2007. Below are the data:
We made a scatterplot and established that the path looked linear and very strongly positive:
There is, of course, a formula for finding the correlation between two variables. However, thanks to graphing calculators and other tools like MS-Excel, we can find the correlation much more easily. First, a bit more about correlation...

 -- Correlation is a number between -1 and +1. It is called "r."
 -- The sign of the correlation indicates whether the association is positive (upward from left
     to right) or negative (downward from left to right)
 -- A correlation of -1 or +1 indicates a perfect association; that is, the scatterplot points could
    be joined to form a straight line.
 -- A correlation of 0 means that there is absolutely no association between the variables.
 -- Correlation has no units associated with it -- it's simply a number.
 -- The order of the variables doesn't matter; their correlation is the same either way.
 -- If you perform an arithmetic operation on each of the data values in one or both sets, the
     value of the correlation is unchanged.

Below shows the MS-Excel way to find correlation. See the highlighted line.
0.94 (rounded) is nearly perfect.

Linear Regression

When an association is linear, we can find the equation of a line that best fits the numbers in the data sets. While there are many lines that look like they fit the data points well, there is only one "best" fitting line. This line is called the least squares regression line, or LSRL.
Like any line, the LSRL is in the form of y = a + bx, where "a" is the y-intercept and "b" is the slope. This line won't run through each point in the scatterplot, but be as close as possible to doing so. In fact, if we were to measure the distances between each of our points and the point on the regression line that has the same x-coordinates, the sum of these squared differences would be less than for any other line. That's what we mean by "best fit" -- what is meant by "least squares."

Rather than get hung up on the "least squares" idea right now, let's remember to come back to it later. Right now, let's find the LSRL equation for our smoking/lung cancer data. First, assuming your scatterplot is in MS-Excel, right-click on any point in your graph and select "Add Trendline" from the drop-down menu. You will see the following sub-menu.





You will see the following sub-menu. Be sure you select "Linear" and "Display equation":


When you close the above window, your trendline and equation will appear:

The LSRL equation, in y = a + bx format and with x and y replaced by their meanings, is:


The inverted "V" symbol over the response variable, LungCancerRate, is called a "hat" and means that the value is a prediction. That is, the regression line equation is a way of predicting a lung cancer rate from a given smoking rate. In fact, this is exactly why we're interested in regression equations -- to help us predict new values based on this best fitting line equation. But predicting comes with cautions. Because the data that produced this equation spans from 1999 to 2007, it is not prudent, nor is it reliable, to predict before or beyond these years -- only in between them. We cannot predict the future based on the past!

(Self-Test): Suppose that in your state, the smoking rate during one of these years was 23.75 per 100,000 people. What would you predict the lung cancer rate to be?
(Answer): You would plug 23.75, your x-value, into the LSRL equation and solve for the y-value...

y-hat = 22.042 + 3.0566*(23.75)
         = 22.042 + 72.594
         = 94.636
The lung cancer rate is predicted to be 94.636 per 100,000 people.

Interpreting the Y-intercept and Slope

It is often useful to interpret the slope and y-intercept in the context of the situation. For instance, the y-intercept is the value produced when x is zero. In our context, x stands for the smoking rate. So, we can say that when the smoking rate is zero -- that is, when a person doesn't smoke -- the lung cancer incidence was 22.042 per 100,000 during these years.


Now for the slope. Remember that the slope is a ratio capturing the change in the y-variable versus the change in x. In our example, that would be the change in lung cancer rates as smoking rates change. The slope in our LSRL equation is 3.0566, which we can think of as 3.0566 / 1. The change in lung cancer rates is the numerator; the change in smoking rate is the denominator. Interpreting this in the context of our situation, we can say that for each 1 in 100,000 that smokes, the lung cancer rate increased by 3.0566 per 100,000 during those years.

1 comment: