## Saturday, October 22, 2011

### Judging the Fit of an LSRL

We have said in a previous post that the LSRL represents the best possible fitting line for the data association. But an LSRL, while best possible, might still not be very good. How do we tell?

Continuing with our Smoking Rate versus Lung Cancer Rate example that we have been discussing, we found that the correlation was 0.94 (excellent) and the LSRL equation was:
To judge the quality of our LSRL, we need to check two things: the residuals and R-squared. Let's take them one at a time.

Residuals

A residual represents, for each x-value, how different the actual y-value is from the predicted value, y-hat. In other words, a residual is the distance between an actual y-value for a given x and the predicted y-value from the regression equation. In the graph below, I have identified two residuals, in orange. There is one residual for each x-value in the dataset; I just chose these two to keep it simple.  Math-wise, a residual equals the actual y-value minus y-hat, the predicted y-value from the LSRL equation.

We find the residuals by plugging each x-value (smoking rates in our case) into the regression equation. That gives us a y-hat for each x. Then we subtract the y-hat value from the original y-value to get the residual. Here is a table of the residual calculations.

Now that we have our residuals, we graph them against the original x-values in a scatterplot. If a regression is good quality, this scatterplot will have no pattern and look completely random. Here is our residual plot:

While there aren't a lot of points in this plot, we can say that there is no discernable pattern, so we can proceed to our other criterion: R-squared.

R-Squared

When we square r, our correlation, we get a statistic called the Coefficient of Determination, or R-Squared. Ours is 0.94*0.94 = .8836. We read R-squared as a percent: about 88.4%.

In reality, other variables contribute to the lung cancer rate than just the smoking rate: heredity, exposure to pollution levels, and the like. R-squared tells us the percent that our explanatory variable, smoking rate, contributes to the association. We say that smoking rate accounts for 88.4% of the variation in lung cancer rate in the linear relationship. That means that other variables, known or unknown, account for the remaining 12%. Not bad at all.

Because our residuals plot looks random and our R-squared value is high, we can say our regression is of high quality and feel confident using the LSRL equation to predict lung cancer rates from smoking rates, within the boundaries of our data. That means all is well if we're dealing with smoking rates between 19.7 and 23.3 people per 100,000.

What happens if we get a pattern in our residuals plot or a low R-Squared value? First, let me point out that you can get one without the other, or you can get both. In either case, your data isn't as linear as it appears and it might not be appropriate to use linear regression unless you can restate your data in a way that makes it more linear. That will be the subject of a  very-near future post.