How's it going so far?

Saturday, October 22, 2011

Judging the Fit of an LSRL

We have said in a previous post that the LSRL represents the best possible fitting line for the data association. But an LSRL, while best possible, might still not be very good. How do we tell?

Continuing with our Smoking Rate versus Lung Cancer Rate example that we have been discussing, we found that the correlation was 0.94 (excellent) and the LSRL equation was:
To judge the quality of our LSRL, we need to check two things: the residuals and R-squared. Let's take them one at a time.

Residuals

A residual represents, for each x-value, how different the actual y-value is from the predicted value, y-hat. In other words, a residual is the distance between an actual y-value for a given x and the predicted y-value from the regression equation. In the graph below, I have identified two residuals, in orange. There is one residual for each x-value in the dataset; I just chose these two to keep it simple.  Math-wise, a residual equals the actual y-value minus y-hat, the predicted y-value from the LSRL equation.


We find the residuals by plugging each x-value (smoking rates in our case) into the regression equation. That gives us a y-hat for each x. Then we subtract the y-hat value from the original y-value to get the residual. Here is a table of the residual calculations.


Now that we have our residuals, we graph them against the original x-values in a scatterplot. If a regression is good quality, this scatterplot will have no pattern and look completely random. Here is our residual plot:


While there aren't a lot of points in this plot, we can say that there is no discernable pattern, so we can proceed to our other criterion: R-squared.

R-Squared

When we square r, our correlation, we get a statistic called the Coefficient of Determination, or R-Squared. Ours is 0.94*0.94 = .8836. We read R-squared as a percent: about 88.4%.

In reality, other variables contribute to the lung cancer rate than just the smoking rate: heredity, exposure to pollution levels, and the like. R-squared tells us the percent that our explanatory variable, smoking rate, contributes to the association. We say that smoking rate accounts for 88.4% of the variation in lung cancer rate in the linear relationship. That means that other variables, known or unknown, account for the remaining 12%. Not bad at all.

Because our residuals plot looks random and our R-squared value is high, we can say our regression is of high quality and feel confident using the LSRL equation to predict lung cancer rates from smoking rates, within the boundaries of our data. That means all is well if we're dealing with smoking rates between 19.7 and 23.3 people per 100,000.

What happens if we get a pattern in our residuals plot or a low R-Squared value? First, let me point out that you can get one without the other, or you can get both. In either case, your data isn't as linear as it appears and it might not be appropriate to use linear regression unless you can restate your data in a way that makes it more linear. That will be the subject of a  very-near future post.

Saturday, October 15, 2011

Correlation and Linear Regression

When two quantitative variables have a linear relationship, we classify the association as strong, moderate, or weak. However, we can also quantify the strength and direction of a linear association between two quantitative variables. To do this, we find the correlation.

We'll continue with our example from the previous post: smoking rates versus lung cancer rates from 1999 through 2007. Below are the data:
We made a scatterplot and established that the path looked linear and very strongly positive:
There is, of course, a formula for finding the correlation between two variables. However, thanks to graphing calculators and other tools like MS-Excel, we can find the correlation much more easily. First, a bit more about correlation...

 -- Correlation is a number between -1 and +1. It is called "r."
 -- The sign of the correlation indicates whether the association is positive (upward from left
     to right) or negative (downward from left to right)
 -- A correlation of -1 or +1 indicates a perfect association; that is, the scatterplot points could
    be joined to form a straight line.
 -- A correlation of 0 means that there is absolutely no association between the variables.
 -- Correlation has no units associated with it -- it's simply a number.
 -- The order of the variables doesn't matter; their correlation is the same either way.
 -- If you perform an arithmetic operation on each of the data values in one or both sets, the
     value of the correlation is unchanged.

Below shows the MS-Excel way to find correlation. See the highlighted line.
0.94 (rounded) is nearly perfect.

Linear Regression

When an association is linear, we can find the equation of a line that best fits the numbers in the data sets. While there are many lines that look like they fit the data points well, there is only one "best" fitting line. This line is called the least squares regression line, or LSRL.
Like any line, the LSRL is in the form of y = a + bx, where "a" is the y-intercept and "b" is the slope. This line won't run through each point in the scatterplot, but be as close as possible to doing so. In fact, if we were to measure the distances between each of our points and the point on the regression line that has the same x-coordinates, the sum of these squared differences would be less than for any other line. That's what we mean by "best fit" -- what is meant by "least squares."

Rather than get hung up on the "least squares" idea right now, let's remember to come back to it later. Right now, let's find the LSRL equation for our smoking/lung cancer data. First, assuming your scatterplot is in MS-Excel, right-click on any point in your graph and select "Add Trendline" from the drop-down menu. You will see the following sub-menu.





You will see the following sub-menu. Be sure you select "Linear" and "Display equation":


When you close the above window, your trendline and equation will appear:

The LSRL equation, in y = a + bx format and with x and y replaced by their meanings, is:


The inverted "V" symbol over the response variable, LungCancerRate, is called a "hat" and means that the value is a prediction. That is, the regression line equation is a way of predicting a lung cancer rate from a given smoking rate. In fact, this is exactly why we're interested in regression equations -- to help us predict new values based on this best fitting line equation. But predicting comes with cautions. Because the data that produced this equation spans from 1999 to 2007, it is not prudent, nor is it reliable, to predict before or beyond these years -- only in between them. We cannot predict the future based on the past!

(Self-Test): Suppose that in your state, the smoking rate during one of these years was 23.75 per 100,000 people. What would you predict the lung cancer rate to be?
(Answer): You would plug 23.75, your x-value, into the LSRL equation and solve for the y-value...

y-hat = 22.042 + 3.0566*(23.75)
         = 22.042 + 72.594
         = 94.636
The lung cancer rate is predicted to be 94.636 per 100,000 people.

Interpreting the Y-intercept and Slope

It is often useful to interpret the slope and y-intercept in the context of the situation. For instance, the y-intercept is the value produced when x is zero. In our context, x stands for the smoking rate. So, we can say that when the smoking rate is zero -- that is, when a person doesn't smoke -- the lung cancer incidence was 22.042 per 100,000 during these years.


Now for the slope. Remember that the slope is a ratio capturing the change in the y-variable versus the change in x. In our example, that would be the change in lung cancer rates as smoking rates change. The slope in our LSRL equation is 3.0566, which we can think of as 3.0566 / 1. The change in lung cancer rates is the numerator; the change in smoking rate is the denominator. Interpreting this in the context of our situation, we can say that for each 1 in 100,000 that smokes, the lung cancer rate increased by 3.0566 per 100,000 during those years.