Statistics Without Tears

Judging the Fit of an LSRL

2011-10-22T10:53:00.000-07:00

We have said in a previous post that the LSRL represents the best possible fitting line for the data association. But an LSRL, while best possible, might still not be very good. How do we tell?

Continuing with our Smoking Rate versus Lung Cancer Rate example that we have been discussing, we found that the correlation was 0.94 (excellent) and the LSRL equation was:

To judge the quality of our LSRL, we need to check two things: the residuals and R-squared. Let's take them one at a time.

Residuals

A residual represents, for each x-value, how different the actual y-value is from the predicted value, y-hat. In other words, a residual is the distance between an actual y-value for a given x and the predicted y-value from the regression equation. In the graph below, I have identified two residuals, in orange. There is one residual for each x-value in the dataset; I just chose these two to keep it simple. Math-wise, a residual equals the actual y-value minus y-hat, the predicted y-value from the LSRL equation.

We find the residuals by plugging each x-value (smoking rates in our case) into the regression equation. That gives us a y-hat for each x. Then we subtract the y-hat value from the original y-value to get the residual. Here is a table of the residual calculations.

Now that we have our residuals, we graph them against the original x-values in a scatterplot. If a regression is good quality, this scatterplot will have no pattern and look completely random. Here is our residual plot:

While there aren't a lot of points in this plot, we can say that there is no discernable pattern, so we can proceed to our other criterion: R-squared.

R-Squared

When we square r, our correlation, we get a statistic called the Coefficient of Determination, or R-Squared. Ours is 0.94*0.94 = .8836. We read R-squared as a percent: about 88.4%.

In reality, other variables contribute to the lung cancer rate than just the smoking rate: heredity, exposure to pollution levels, and the like. R-squared tells us the percent that our explanatory variable, smoking rate, contributes to the association. We say that smoking rate accounts for 88.4% of the variation in lung cancer rate in the linear relationship. That means that other variables, known or unknown, account for the remaining 12%. Not bad at all.

Because our residuals plot looks random and our R-squared value is high, we can say our regression is of high quality and feel confident using the LSRL equation to predict lung cancer rates from smoking rates, within the boundaries of our data. That means all is well if we're dealing with smoking rates between 19.7 and 23.3 people per 100,000.

What happens if we get a pattern in our residuals plot or a low R-Squared value? First, let me point out that you can get one without the other, or you can get both. In either case, your data isn't as linear as it appears and it might not be appropriate to use linear regression unless you can restate your data in a way that makes it more linear. That will be the subject of a very-near future post.

Correlation and Linear Regression

2011-10-15T11:32:00.000-07:00

When two quantitative variables have a linear relationship, we classify the association as strong, moderate, or weak. However, we can also quantify the strength and direction of a linear association between two quantitative variables. To do this, we find the correlation.

We'll continue with our example from the previous post: smoking rates versus lung cancer rates from 1999 through 2007. Below are the data:

We made a scatterplot and established that the path looked linear and very strongly positive:

There is, of course, a formula for finding the correlation between two variables. However, thanks to graphing calculators and other tools like MS-Excel, we can find the correlation much more easily. First, a bit more about correlation...

-- Correlation is a number between -1 and +1. It is called "r."
-- The sign of the correlation indicates whether the association is positive (upward from left
     to right) or negative (downward from left to right)
-- A correlation of -1 or +1 indicates a perfect association; that is, the scatterplot points could
    be joined to form a straight line.
-- A correlation of 0 means that there is absolutely no association between the variables.
-- Correlation has no units associated with it -- it's simply a number.
-- The order of the variables doesn't matter; their correlation is the same either way.
-- If you perform an arithmetic operation on each of the data values in one or both sets, the
     value of the correlation is unchanged.

Below shows the MS-Excel way to find correlation. See the highlighted line.

0.94 (rounded) is nearly perfect.

Linear Regression

When an association is linear, we can find the equation of a line that best fits the numbers in the data sets. While there are many lines that look like they fit the data points well, there is only one "best" fitting line. This line is called the least squares regression line, or LSRL.
Like any line, the LSRL is in the form of y = a + bx, where "a" is the y-intercept and "b" is the slope. This line won't run through each point in the scatterplot, but be as close as possible to doing so. In fact, if we were to measure the distances between each of our points and the point on the regression line that has the same x-coordinates, the sum of these squared differences would be less than for any other line. That's what we mean by "best fit" -- what is meant by "least squares."

Rather than get hung up on the "least squares" idea right now, let's remember to come back to it later. Right now, let's find the LSRL equation for our smoking/lung cancer data. First, assuming your scatterplot is in MS-Excel, right-click on any point in your graph and select "Add Trendline" from the drop-down menu. You will see the following sub-menu.

You will see the following sub-menu. Be sure you select "Linear" and "Display equation":

When you close the above window, your trendline and equation will appear:

The LSRL equation, in y = a + bx format and with x and y replaced by their meanings, is:

The inverted "V" symbol over the response variable, LungCancerRate, is called a "hat" and means that the value is a prediction. That is, the regression line equation is a way of predicting a lung cancer rate from a given smoking rate. In fact, this is exactly why we're interested in regression equations -- to help us predict new values based on this best fitting line equation. But predicting comes with cautions. Because the data that produced this equation spans from 1999 to 2007, it is not prudent, nor is it reliable, to predict before or beyond these years -- only in between them. We cannot predict the future based on the past!

(Self-Test): Suppose that in your state, the smoking rate during one of these years was 23.75 per 100,000 people. What would you predict the lung cancer rate to be?

(Answer): You would plug 23.75, your x-value, into the LSRL equation and solve for the y-value...

y-hat = 22.042 + 3.0566*(23.75)

= 22.042 + 72.594

= 94.636

The lung cancer rate is predicted to be 94.636 per 100,000 people.

Interpreting the Y-intercept and Slope

It is often useful to interpret the slope and y-intercept in the context of the situation. For instance, the y-intercept is the value produced when x is zero. In our context, x stands for the smoking rate. So, we can say that when the smoking rate is zero -- that is, when a person doesn't smoke -- the lung cancer incidence was 22.042 per 100,000 during these years.

Now for the slope. Remember that the slope is a ratio capturing the change in the y-variable versus the change in x. In our example, that would be the change in lung cancer rates as smoking rates change. The slope in our LSRL equation is 3.0566, which we can think of as 3.0566 / 1. The change in lung cancer rates is the numerator; the change in smoking rate is the denominator. Interpreting this in the context of our situation, we can say that for each 1 in 100,000 that smokes, the lung cancer rate increased by 3.0566 per 100,000 during those years.

Associations Between Two Variables

2011-09-18T08:17:00.000-07:00

Up till now, we've been analyzing one set of quantitative data values, discussing center, shape, spread, and their measures. Sometimes two different sets of quantitative data values have something to do with each other, or so we suspect. This post (and a few of those that follow), deal with how to analyze two sets of data that appear to be related to each other.

We all know now that smoking causes lung cancer. This fact took years and lots of statistical work to establish. The first step was to see if there was a connection between smoking and lung cancer. This was done by charting how many people smoked and how many people got lung cancer. This is where we'll start, only we'll do it with more current data for purposes of illustration. Below is a table showing the incidence rates of smoking and lung cancer per 100,000 people, as tracked by the Centers for Disease Control and Prevention, from 1999 - 2007.

Because smoking explains lung cancer incidence and not the other way around, we call the smoking rate variable the explanatory variable. The lung cancer rate variable is called the response variable.

Looking over this table, we can see that smoking rates -- and lung cancer rates -- have both decreased during these 9 years. To illustrate how we deal with two variables at a time, let's continue with this data. The first thing to do is to plot the data on an xy-grid, one point per year. For 1999, for example, we plot the point (23.3, 93.5). Continuing with all the points, we get a scatterplot:

Note that the explanatory variable is on the x-axis, while the response variable is on the y-axis.

When you view a scatterplot, you are looking for a straight-line, or linear, trend. What I do is sketch the narrowest possible oval around the dots. The narrower the oval, the stronger the linear relationship between the variables. As you can see below, the relationship between smoking and lung cancer is quite strong because the oval is skinny:

If our oval had looked more like a circle, we would conclude that there is no relationship. A fatter oval that has a discernable upward or downward direction would indicate a weak association.

This upward-reaching oval also tells us that the association is positive; that is, that as one variable increases, so does the other one. A downward-reaching oval would indicate a negative association, which means that as one variable increases, the other decreases.

When describing an association between two quantitative variables, we address form, strength, and direction. The form is linear, strength is strong, and direction is positive. So we would say, "The association between smoking and lung cancer between 1999 and 2007 is strong, positive, and linear."

Just so you know, there are other forms of association between two quantitative variables: quadratic (U-shaped), exponential, logarithmic, etc. But we limit ourselves for now to linear associations.

Another *really* important thing to remember is that just because there is a linear association between two quantitative variables, it doesn't necessarily mean that one variable causes the other. We know that in the case of smoking and lung cancer, there is a cause-effect relationship, but this was established by several controlled statistical experiments. Seeing the linear association was only the catalyst for further study...it wasn't the culmination!

In the next post, we'll continue with this same example to develop some of the finer points of analyzing associations between two quantitative variables. For now, here is a good term to know: Another way to say the end of that previous sentence is "...analyzing associations for bivariate data." Bivariate simply means "two variables."

The Standard Normal Model: Standardizing Scores

2011-09-05T07:08:00.000-07:00

For the purposes of this post, we will refer to the data values in a data set as "scores."

In the last post, we used an example N(14, 2) to illustrate the 68-95-99.7 Rule, which stands for the various percents of scores lying within 1, 2, and 3 standard deviations from the mean. We can generalize the diagram we used to represent N(0, 1), where 0 is the mean and 1 is the standard deviation. This makes the model easier to apply, because the units we're most accustomed to seeing -- -1, 0,1,2, and so on -- appear as standard deviation units. Take a look:

We call N(0,1) the Standard Normal model. We now can use the number line to locate points that are any number of standard deviations from the mean...even fractional numbers.

In any Normal model, we're going to want to see what how many standard deviations a particular score in the data set is from its mean. We can do this for any score, and it has to do with converting a "raw" score (a score from our data) to a "standardized" score (a score from the Standard Normal model. How do we do this? There's a formula, and it's really an easy one:

...where
X stands for the score you're trying to convert,

stands for the mean, and

stands for the standard deviation.

Suppose we're back in the Normal model N(14, 2) and we want to see how many standard deviations a score of 15 is from the mean. We would subtract our mean from 15, then divide by the standard deviation, 2. That is: (15 - 14) / 2 = 0.5. This mean that our score of 15 is 0.5 standard deviations from the mean. 0.5 is the score on the Standard Normal model that represents our score from N(14, 2). We call 0.5 our standardized score, also known as a z-score. Z-scores tell us how many standard deviations a given "raw" score is from the mean.

(Self-Test): In the Normal model N(50, 4), standardize a score of 55.
(Answer): Find the z-score using the formula:

z = (55 - 50) / 4 = 5 / 4 = 1.25. The score is 1.25 standard deviations above the mean.

(Self-Test): In the Normal model N(50, 4), find the z-score for 42.
(Answer): z-scores can be negative, too. in this problem, 42 is less than the mean. So it will lie to the left of 50, and its z-score will be negative. z = (42 - 50) / 4 = -8 / 4 = -2. The score is 2 standard deviations below the mean.

Why would we want to standardize our scores? There are actually two reasons.

1. It can help us see how unusual a score might be.

How? Well, the percents we have been talking about can also be thought of as probabilities. For example, the probability that a score is greater than the mean is 50%, the same as the probability that a score is less than the mean. In the last post, we marked off regions and computed percents. In statistics, we consider any z-score of 3 or more, or -3 or less, as unusual, because (as we saw in the last post) only 0.15% of scores are in each of those regions. In other words, the probability of seeing a score in one of these regions is at most 0.15%, less than even a quarter-percent. That's unusual.

2. It can allow us to compare apples to oranges.

How? Well, suppose you have just gotten back 2 tests you took: one in algebra and one in earth science. Suppose further that both score distributions follow a [different] Normal model. The algebra test's scores follow N(80, 5) and the earth science test's scores follow N(85, 8). Now imagine that you got a 90 on the algebra test and a 93 on the earth science test. You can easily see that percentage-wise, your score on the earth science test is higher than your algebra score. But relative to the distributions, on which test did you perform better?

To find out, figure out the z-score for each test score. For your algebra test, your z-score is (90 - 80) / 5 = 10 / 5 = 2 (2 standard deviations above the mean). For your earth science test, your z-score is (93 - 85) / 8 = 8 / 8 = 1 (1 standard deviation above the mean. Relatively speaking, your performance was better on the algebra test; that is, your score was more exceptional. Think of the probabilities. The probability of a score that's 2 standard deviations or more above the mean is 2.5%, whereas a score that's 1 standard deviation or more above the mean is 16%. Get the idea?

(Self-Test) Suppose Tom's algebra test score was 86 and his earth science test score was 86. In which test did Tom perform better, given that the test scores follow the Normal models we used above?
(Answer): Tom's z-score on the algebra test was z = (86 - 80) / 5 = 6/5 = 1.20. His z-score on the earth science test was z = (86 - 85) / 8 = 1/8 = 0.125. Because Tom's z-score on his algebra test (1.20) is higher than his earth science test z-score (0.125), his algebra performance was better than his earth science performance.

One more thing...let's use the z-score formula to go backwards: to convert a standardized score back to a raw score. Suppose, Nancy earned a score on the algebra test that was 1.6 standard deviations below the mean. (In other words, her z-score was 1.6.) What actual percentage score would that represent for her, assuming N(80, 5)? In this case, we would start with the z-score formula and fill in what we know. Then, using algebra (!) we would solve for the "raw" score...

Nancy's algebra test score was 72.
In the next post, we'll expand on the probability side of the Standard Normal model.

More About the Normal Distribution: the 68-95-99.7 Rule

2011-08-31T15:07:00.000-07:00

In the last post, we covered the Normal distribution as it relates to the standard deviation. We said that the Normal distribution is really a family of unimodal, symmetric distributions that differ only by their means and standard deviations. Now is a good time to introduce a new term: Parameter. When dealing with perfect-world models like the Normal model, their major measures -- in this case, their mean and standard deviation -- are called parameters. The mean (denoted by the Greek letter "mu" (pronounced "mew"), µ, and the standard deviation is denoted by the Greek letter sigma, σ. We can refer to a particular normal model by identifying µ and σ and using the letter N, for "Normal:" N(µ, σ). For example, if I want to describe a Normal model whose mean is 14 and whose standard deviation is 2, I use the notation N(14, 2).

Every Normal model has some pretty interesting properties, which we will now cover. Take the above model N(14,2). We'll draw it on a number line centered at 14, with units of 2 marked off in either direction. Each unit of 2 is one standard deviation in length. It would look like this...

Now let's mark off the area that's between 12 and 16; that is, the area that's within one standard deviation of the mean. In a Normal model, this region will contain 68% of the data values:

If you've ever heard of "grading on a curve," it's based on Normal models. Scores within one standard deviation of the mean would generally be considered in the "C" grade range.

Now, if you consider the region that lies within two standard deviations from the mean; that is, between 10 and 18 in this model, this area would encompass 95% of all the data values in the data set:

From a grading curve standpoint, 95% of the values would be Bs or Cs. Finally, if you mark off the area within 3 standard deviations from the mean, this region will contain about 99.7% of the values in the data set. These extremities would be the As and Fs in our grading curve interpretation.

In statistics, this percentage breakdown is called the "Empirical Rule," or the "68-95-99.7 Rule."

What about the regions up to 8 and beyond 20? These areas account for the remaining 0.3% of the data values? That would make 0.15% on each side.

(Self-Test): What percent of data values lie between 12 and 14 in N(14,2) above?
(Answer): If 68% represents the full area between 12 and 16 and given that the Normal model is perfectly symmetric, there must be half of 68%, or 34% between 12 and 14. Likewise, there would be 34% between 14 and 16.

(Self-Test): What percent of data values lie between 16 and 18 in N(14,2) above?
(Answer): Subtract 95% minus 68% = 27%. This represents the number of data values between 10 and 12, and between 16 and 18 both. Divide by 2 and you get 13.5% in each of these regions.

If you do similar operations, the various areas break down like the following:

(Self-Test): In any Normal model what percent of data values are greater than 2 standard deviations from the mean?
(Answer): Using the above model, the question is asking for the percent of data values that are more than 18. You would add the 2.35% and the 0.15% to get the answer: 2.50%.

This is all well and good if you're looking at areas that involve a whole number of standard deviations, but what about all the in-between numbers? For example, what if you wanted to know what percent of data values are within 1.5 standard deviations of the mean? This will be the subject of a future post.

More About Standard Deviation: The Normal Distribution

2011-08-27T07:37:00.000-07:00

I've haven't said a lot about Standard Deviation up to now: not much more than the fact that it's a measure of spread for a symmetric dataset. However, we can also think of the standard deviation as a unit of measure of relative distance from the mean in a unimodal, symmetric distribution.

Suppose you had a perfectly symmetric, unimodal distribution. It would look like the well-known bell curve. Of course, in the real world, nothing is perfect. But in statistics, we talk about ideal distributions, known as "models." Real-life datasets can only approximate the ideal model...but we can apply many of the traits of models to them.

So let's talk about a perfectly symmetric, bell-shaped distribution for a bit. We call this model a Normal distribution, or Normal model. Because we're dealing with perfection, the mean and median are at the same point. In fact, there are an infinite number of Normal distributions with a particular mean. They only differ in width. Below are some examples of Normal models.
Notice that their widths differ. Another word for "width" is "spread"...which brings us back to Standard Deviation! Take a look at the curves above. In the center section, the shape looks like an upside down bowl, whereas the outer "legs" look like part of a right-side-up bowl. Now imagine the point at which the right-side-up parts meet the upside-down part. Look below for the two blue dots in the diagram. (P.S. They are called "points of inflection," in case you were wondering.)

As shown above, if a line is drawn down the center, the distance from that line to a blue point is the length of one standard deviation. Can you see which lengths of standard deviations in the earlier examples are larger? Smaller?

So, there are two measures that define how a particular Normal model will look: the mean and the standard deviation.

I'd be remiss if I didn't tell you that there is a formula for numerically finding the standard deviation. Luckily, there's a lot of technology out there that automatically computes this for you. (I showed you how to do this in MS-Excel in an earlier post.)

Suppose you have a list of "n" data values, and when you look at a histogram of these values, you see that the distribution is unimodal and roughly symmetric. If we call the values x1, x2, x3, etc. We compute the mean (average, remember?) and note it as x with a bar over it. Then the formula is:

What does the ∑ mean? Let's take the formula apart. First, you are finding the difference between each data value and the mean of the whole dataset. You're squaring it to make sure you're dealing only with positive values. The ∑ means you should add up all those positive squared answers, one for each value in your dataset. Once you have the sum, you divide by

(n-1), which gives you an average of all the squared differences from the mean. This measure (before taking the square root) is called the variance. When you take the square root, you have the value of the standard deviation. So, you see, the standard deviation is the square root of the squared differences from the mean.

Well, that's a lot to digest, so I'll continue with the properties of the Normal model in my next post.

Analyzing Quantitative Distributions

2011-08-20T08:05:00.000-07:00

This post deals only with quantitative data.

When you have a quantitative dataset, it is always a good idea to look at a graphical display of it: usually a histogram or boxplot, although there are others. What you are looking to describe is the shape of the data (the subject of a recent post), the approximate center of the dataset, the spread of the values, and any unusual features (such as extremely low or high values -- outliers). We'll take them one at a time.

Shape
The shape of the dataset helps us determine how to report on the other features. We went into detail about shape in a very recent post ("The Shape of a Quantitative Distribution"). If the display is basically symmetric, you will use the mean to describe the center and a measure called the Standard deviation to describe the spread. If the display is non-symmetric, you will use the median to describe the center and the interquartile range to describe the spread. Here's a handy chart to summarize this.

Center
In a symmetric distribution the mean and median are at approximately the same place; however, statisticians use the mean. In a non-symmetric distribution, the median is used because by its very definition, it is not calculated using any extreme points or points that skew the calculation.

Spread
Notice that the range (which is the difference between the largest and smallest data value in the set) is not generally used to describe the spread. In a symmetric distribution we use the standard deviation, which has a complicated formula but a simple description: the Standard Deviation is the average squared difference between each value in the dataset and the mean of the dataset. You can find the standard deviation using a graphing calculator or a tool like MS-Excel. The symbol for Standard Deviation is

Standard deviations are relative. By that I mean that you can't tell just by looking at its value whether it's large or small -- it depends on the values in the dataset. Standard deviations are good for comparing spreads if you have two distributions. And, we'll soon find out that the standard deviation has an extremely important use in statistics. Here's an example for finding the mean and standard deviation using MS-Excel:

In a non-symmetric distribution we use the Interquartile Range (IQR) because this tells the spread of the central 50% of the data values. Like the median, the IQR isn't influenced by outliers or skewness. Here's an example of finding the median and IQR using MS-Excel:

Unusual Features
As mentioned above, unusual features include extreme points (if any), also known as outliers. In an earlier post, we covered how to determine the boundaries for outliers. If a data value lies outside the boundaries, we call it an outlier. If a value isn't quite an outlier but close, it's worth mentioning in a description of a distribution. Just call it an "extreme point."

The Shape of a Quantitative Distribution

2011-08-16T13:06:00.000-07:00

When you graph quantitative data, you can often see some kind of shape emerge. Here are some typical block structures that illustrate some possible shapes a display might take. We call the set of values "distributions."

The first thing you need to determine is if there is any symmetry to the graph. If you were to visualize a vertical line going down the center, does each side look like a mirror image of the other? No real-life distribution will be perfectly symmetrical, but if it's close, it's worth mentioning.

(Self-Test): Which three of the above graphs look symmetric?
(Answer): B, D, and E.

The next thing you might notice is that some graphs have peaks whereas others look pretty level. We call the peaks "modes." If there's one peak, we say the graph has a unimodal distribution. If there are two peaks, the graph has a bimodal distribution.

(Self-Test): Which three of the above graphs look unimodal? Which is bimodal?
(Answer): A, B, and C are unimodal; D is bimodal.

FYI, a graph that's mostly level-looking, like graph E, is called uniform.

Now take a look at distributions A and C. Do you see that each has a "tail" at one end? When a distribution is off-center (compared to graph B), we say it is skewed. The direction of the tail is the direction of the skewness. For example, graph A is skewed left because the tail is on the left side of the graph. graph C is skewed right.

When we describe a distribution we try to describe its shape, center, and spread, plus anything unusual about it, such as outliers. This will be the subject of my next post.

(Self-Test): Which of the distributions pictured above might have outliers?
(Answer): Skewed graphs, like A and C, depending on the length of their tails. The longer the tail, the more likely there are outliers.

Picturing Quantitative Data

2011-08-14T10:30:00.000-07:00

Just in case you didn't read the post "Picturing Categorical Data," here's a tiny bit of rehash: The first thing to note that different types of data are pictured in different ways. You might recall from an earlier posting that there are two types of data:

Quantitative data are numbers for which it makes sense to do arithmetic on them. For example, test scores, dollars, miles, etc. it is meaningful to, say, find their average. Quantitative data usually have labels.

Categorical data are sub-levels of variables that you do not combine with arithmetic. You usually count them. Examples: eye colors (bllue, brown, green, hazel), education levels (high school, undergrad, graduate, etc.). Some categorical variables are numbers, but not ones that make sense to add; examples are zip codes, numbers on athletic jerseys, phone numbers.

This posting deals with quantitative data only. We deal twith categorical data in an earlier post.

There are many ways to picture quantitative data, but let me introduce two that are most used: the boxplot and the histogram. In both cases, let's use the following set of scores, which are arranged in descending order.

98	86	81
95	86	75
92	86	75
91	85	75
91	82	74
90	82	73
89	81	72

Boxplots

(Self-Test): In the earlier post, entitled "Percentiles and Quartiles," we discussed the 5 number summary, consisting of which 5 measures?
(Answer): the minimum, the 1st quartile (Q1), the median, the 3rd quartile (Q3), and the maximum.

A boxplot uses these 5 numbers, arranged on a number line according to their values. As we discussed, each of the four regions marked off by these 5 numbers contain 1/4 of the values in the data set, even if they don't have the same width.

There are 21 numbers shown above. The minimum is 72 and the maximum is 98. Since the number are arranged numerically, we can count to the 11th number to find the median: 85.

We now have 10 numbers on each side of the median. The 1st and 3rd quartiles are the medians of the lower 10 and upper 10 numbers, respectively. Q1 is 75, and Q3 is midway between 90 and 91. Remember: just look between the 5th and 6th number from the bottom and the top in each case.

So, our 5-number summary is: 72 (min), 75 (Q1), 85 (median), 90.5 (Q3), and 98 (max). To make a boxplot, draw a number line that stretches from, say, 70 to 100 and place the 5 numbers in their proper places. Then draw a box around the middle 3 numbers to show the span of the middle 50% of the number set:

A boxplot is helpful not only to see the spread of the data set, but also to see the symmetry of the data. For example, the stretch between the minimum and the median is about the same size as the stretch between the median and the maximum, which tells us that the mean lines up pretty well with the median. (Recall that it doesn't always end up this way!) The unevenness of the outer regions (which we call the "whiskers") are uneven, telling us that the data looks more heavily weighted toward the low side of the number line. If the boxplot looked like the following, the data would be roughly symmetric:

Symmetry in a data set is important to recognize, because we can assume other important traits. More about this in a future post.

Now, if a data set has outliers, they are shown with a symbol; for instance, a star:

In cases like this, a new maximum that lies within the boundary for outliers is determined. As I said in a previous post, you should look for possible reasons for outliers.

Histograms

A histogram tracks the number of observations (like the values in the above table) that lie within consecutive, same-width intervals on a number line. While most people do not make histograms by hand, we'll go through how it's done so you can understand what a histogram looks like. In statistics histograms are used heavily; usually to see the shape of the data distribution.

Start by examining the data values. They range from 72 to 98, and the range (as defined in statistics) is 98 - 72 or 26. We want to divide 26 into a number of equal-sized intervals, then find out how many of the numbers in the data set lie in each interval. We will represent that number as the height of that interval's bar.

How many intervals do we need? Well, to me, that's a "squishy" question. You want enough to be able to see a shape, but not so many that you hardly have any values in each interval. A rule of thumb that I use is between 6 and 8. Because our range is 26, I'm going to use 7 for intervals, because 7*4 is 28, which is closer to 26 than a multiple of 6 or 8. So we have the following number line. Each interval will be 4 units wide (again, because 7*4 is 28), so I've numbered the line accordingly.

(Self-Test): What's with the // on the number line above?
(Answer): Number lines should always be to scale, but in this case a number line going from 0 to 100 probably wouldn't fit on the page. So, I use the // to indicate a break in the numbering.

Now I need to make a small decision: Shall I count the labelled numbers as the start or the end of the interval? It doesn't matter, just as long as you're consistent. I'm going to count the numbers as the start of each interval. So, my first interval will go from 72 to 75; my second will be from 76 to 79, and so on.

For each interval, you need to count how many values in the data set fall within that interval. For example, there are 6 numbers that are between 72 and 75, inclusive. So the height of my bar for the first interval will be 6 units high.

Our second interval, which goes from 76 to 79, has no values. So we leave a space to show that zero values are in that interval. The next interval goes from 80 to 83, and there are four values that lie in that interval. So, the bar will be 4 units high.

Continuing on for the rest of the intervals, here is our finished product:

This histogram has an interesting shape. If the dataset represents grades, we can assume that there 6 out of 21 "C" students, while twice that many are in the "B" to "A-" range. I'll be saying more about shapes of distributions in a later post. For now, suffice it to say that the shape of a distribution is an important thing to assess when you're looking at data.

Keeping it Simple -- the Area Principle

2011-08-12T13:37:00.000-07:00

If you made it through the [rather long] post entitled "Picturing Categorical Data," this next post brings out a fine point about making graphical displays. If you look through articles and newspapers, or perhaps at slides that people make in your office for presentations, you might see a tendency for folks to want to make them fancy. While this is admirable -- trying to show potentially dry information in a more splashy way -- it can end up being misleading. Read on.

Take a look at this pie chart, which graphs quantities that are 10%, 20%, 30%, and 40% of the whole.

Now compare it to its flashy 3-D counterpart:

"What's the difference?" you might ask. However, I would argue that the 3-D version makes the green slice (30%) look larger than the purple (40%) slice. Can you see it? This doesn't always happen with 3-D displays, but 3-D displays are prone to this. It's something that you want to look out for, just in case.

When a smaller pie slice or bar (in a bar chart) looks larger than a slice or bar that represents a larger quantity, we say that the display violates the Area Principle. In the first display, each piece was proportionately sized relative to the others.

Why make this point? When trying to communicate something graphically, the main point is to get the information across with as little potential confusion as possible -- not to impress people with fancy pictures. Statistics can be mind-boggling to many, so why not try to make things as straightforward as possible? It's the old K.I.S.S. principle. Statistics-challenged people will thank you! (As I'm sure you're thanking me for the short post!)

Has the Stock Market Hit Bottom?

2011-08-10T11:03:00.000-07:00

Oh, if only there were a nice statistical way to answer that question! In the August 10, 2011 issue of USA Today, the article at http://www.usatoday.com/money/markets/2011-08-09-has-market-hit-bottom_n.htm offered a lot of insights from experts, but the authors were wise enough to know that there’s just no telling whether the stock market will rise or fall, or whether better days are to come yet.

For anyone who’s not aware, the Dow Jones Industrial average (DJIA) plummeted 535 points on Monday, following the bad news that the US credit rating had dropped from AAA to AA+, according to Standard & Poors and probably a multitude of other events that stripped investors of their confidence. The following day, August 9, the DJIA gained back 435 points, bringing it closer to pre-plummet levels. At this writing (Wednesday, August 10, 2011 at 1:35 p.m.), the Dow is down about 357 points.

The insights offered by experts as to why the market is plummeting range from emotional sell-offs in light of the US credit rating downgrade, to the state of theoverall world economy, to the lack of warm fuzzies in Congress, but no one is making a the mistake of trying to predict the future. Nothing, not even statistics, can do that reliably.

Theoretically, even if somehow we were able to identify *all* of the factors (“variables” we call them) that make for fluctuations in the stock market, and then we found some realistic way to quantify each of the factors, we could come up with a [probably quite complicated] equation involving all of those variables that we could use to predict future behavior, but it would be just that – a prediction. You simply cannot use statistics to foretell the future. The best one can do is make a fact-based, educated guess.

Similarly, you cannot use the past to predict the future. Most of fall into this trap pretty easily. For example, a lot of people moved their investments to Treasury-related securities, because Treasury issues have shown a steadily upward trend without the volatility of riskier investments, like stocks. In fact, in the past 10 years, inflation-protected securities (TIPS) have out-performed the Dow, the Nasdaq, and the S&P 500 significantly…at about half the risk (as rated by Vanguard). So are you rushing to your favorite investment site to move your money? If you are, then you’re using past performance to predict the future! Sometimes it works, but just as often it doesn’t.

Has the stock market hit bottom? Read the article, and you be the judge. Just trying to set your expectations about what statistics can and cannot do!

Picturing Categorical Data

2011-08-07T14:51:00.000-07:00

They say a picture is worth a thousand words, and that's certainly true with statistical data! The first thing to note that different types of data are pictured in different ways. You might recall from an earlier posting that there are two types of data:

Categorical data are sub-levels of variables that you do not combine with arithmetic. You usually count them. Examples: eye colors (bllue, brown, green, hazel, gray), education levels (high school, undergrad, graduate, etc.). Some categorical variables are numbers, but not ones that make sense to add; examples are zip codes, numbers on athletic jerseys, and phone numbers.

Quantitative data are numbers for which it makes sense to do arithmetic on them. For example, with test scores, dollars, miles, etc., it is meaningful to, say, find their average. Quantitative data usually have labels.

In this post, we'll picture categorical data only; I'll cover quantitative data in another post.

There are two ways to "graphically" picture categorical data: bar charts and pie charts. You see these all the time in articles. Let's use the following base data that shows eye colors in a group of 115 people:

Bar Charts
The most straightforward type of display is a bar chart. Bar charts can portray the actual numbers...

...or they can portray percents of the whole (115)...

Notice that the bars can be vertical or horizontal. In either case, the bars are arranged in either increasing or decreasing size.

If you have a number of very small bars, you can put them together as a larger combined bar and label it "Other." Such bars don't necessarily need to be placed in order; they usually appear as the last bar. Suppose, for example, you want to put green, hazel, and gray together as an "Other" bar. Then your graph would look like this:

Sometimes you will see a single bar with sections representing each category in proportion. This is called a segmented (or stacked) bar chart. They can appear with the actual counts (height of the single bar is equal to the sum of the counts), or as percentages (height of the single bar represents 100%). Here's how a segmented bar chart with percents would look. Notice that it's good to arrange the bars in decreasing order from bottom to top:

Segmented bar charts are especially helpful in comparing the same categories in two or more different groups. Suppose we had a second group of people whose segmented bar chart of eye colors looked slightly different. We could put the bars side by side in a single display, with the eye colors in the same order.

It is easy to see that there are fewer brown-eyed people in Group 2, but more blue-eyed people, about the same number of green and gray eyed-people, but fewer with hazel eyes.

One caution with two or more segmented bars: Unless each group is exactly the same size, you should use percents rather than counts. Otherwise it would be nearly impossible to compare the bars.

Pie Charts
A pie chart shows each category as a proportional-sized pie slice. Pie charts always use percents. So, a pie chart for our eye color data would look like this:

Notice that the pieces are arranged in order of size as you go around the pie.

(Self-Test): Suppose there's another group -- this time consisting of 140 people -- whose eye colors are as follows. Make a "Group 3" segmented bar chart to show the differences.

(Answer): See below.

The Interquartile Range and Outliers

2011-07-31T14:10:00.000-07:00

In the previous post, I introduced percentiles and quartiles and said that the Interquartile Range (IQR) is found by subtracting Q3 (the third quartile) minus Q1 (the first quartile). The IQR is significant because it tells us where the middle 50% of the numbers in the data set lie. This is especially helpful to know if there are extreme values at the low and/or high end of the data set.

Take the very short data set 40, 80, 86. 88, 90. Notice that the 40 is extremely far away from the other numbers; it is considered an extreme value. It would make the average (mean) unreasonably small with respect to the majority of the other numbers. We call 40 in this set an outlier. Outliers can occur by coincidence but just as often there is some reason behind them. Statisticians should at least try to see if there's an explanation when they run into an outlier. Then they can analyze the data set with that in mind. Sometimes it's good to do two analyses: one with the outlier, and one without it.

How do we tell if a number is extreme enough to be an outlier? Well, believe it or not, there is a mathematical way and it's pretty straightforward. To test for outliers, we need to know what the 1st and 3rd quartiles are, so we can compute the IQR. Here are the steps:

Find the IQR.
Multiply the IQR by 1.5.
Add the resulting number to Q3 to get an upper boundary for outliers.
Subtract the same resulting number (from #2) from Q1 to get a lower boundary for outliers.
If a number in the data set lies beyond either boundary, it is considered an outlier.

In the example above (40, 80, 86, 88, 100), Q1 is 80 and Q3 is 88. Going through the steps...

Q3 - Q1 = 88 - 80 = 8 (The IQR)
IQR * 1.5 = 8 * 1.5 = 12
Q3 + 12 = 88 + 12 = 100. This is the upper boundary for outliers. Since our maximum is right at 100, we have no high outliers.
Q1 - 12 = 80 - 12 = 68. This is the lower boundary for outliers. Since our minimum number, 40, is less than 68, our data set has one low outlier: 40.

Note that there can be any number of outliers: one on each side, two high ones and no low ones, and so forth. If there are outliers, we note them as such, try to explain them if possible, and restate and minimum and maximum at the boundaries we found.

Moral of the Story: How do we try to explain outliers? The best way is to examine the data: where it came from, what it represents, and where and how it was collected, In other words, look at the context. An outlier could turn out to be a simple mis-keying error, or might indicate something more serious. Don't make something up! But do have a look to see if there's something obvious going on.

Connection or Causation?

2011-07-30T08:04:00.000-07:00

I was reading an article the other day about a possible connection between mental illness and nose jobs, and thought, "Aha! A perfect possible example of crazy statistics!" But I was not only disappointed on that count, but also pleasantly surprised that the article showed valid statistical practices. What I can focus on in this post are two things:

What valid statistical practices were used?
How to avoid common pitfalls when reading an article that involves statistics.

First, you might want to quickly read the article:
http://well.blogs.nytimes.com/2011/07/27/some-nose-job-patients-may-have-mental-illness/
Go ahead; I'll wait.

There are enough bad statistical articles out there, which I'm sure I'll happen upon and grab for another post, but as I said, this article is not one of them. First, the author was careful to qualify the difference between people who have valid medical reasons to seek plastic surgery on their noses and those who appear to be negatively obsessed with a perfectly normal-looking (from a plastic surgeon's standpoint) nose. The author was careful to separate the two.

This was a "controlled" study, as statisticians say. First, there was a rather large sample (266) of patients seeking nose jobs in Belgium, who filled out a diagnostic questionnaire that was intended to uncover the condition (called Body Dysmorpphic Disorder (BDD)). The author included all of the vital information -- number of people, location of the study, duration of the study, and a link to the full journal article so that the study could be reproduced if desired. Rather than worrying that a duplicate study will refute the findings of the original study, most statisticians welcome "do-overs" of the study, because it can either strengthen their findings or point out something they might have overlooked. After all, from an ethical standpoint, most researchers are after the truth, so they try to make the specifics as transparent as possible.

Separating those patients who had a "valid" complaint about their nose (for instance, a breathing problem) from those who showed signs of BDD was an excellent example of controlling the study. For example, if they hadn't done this and went on conducting the study, they couldn't have separated the "valid" patients from the BDD patients later. By controlling the study, they were able to determine that only 2% of the "valid" patients showed evidence of BDD while 43% of the "invalid" patients did. If everyone had been lumped together, the results might have been diluted. Researchers need to think of possible variables ahead of time that could muddy the waters later. The variable of "valid" versus "invalid" was one of these. We call these variables lurking variables, because they can work under the surface, making it look like one thing (seeking a nose job) is causing the other (BDD)... or vice versa.

Another interesting and admirable thing the article did was volunteer extra information that the reader might have had upon reading the article; that is, they singled out nose jobs from other plastic surgery as being notable with respect to this connection. They seem to have thought things through very well.

Now, on to you as a reader of such articles. My only caution in this case is that you don't read any type of cause and effect situation into the article. Having BDD doesn't necessarily cause people to seek a nose job, although there seems to be a very strong connection between the two. Further study and repetition of the study would be needed to prove cause and effect. To its credit, the author of this article was careful not to lead you in the causation direction.

Simply put: Just because two variables appear to be connected doesn't prove that one variable causes another. For those readers who recall the "old days" when people smoked cigarettes and didn't know about the dangers of lung cancer, remember how it took years to get even the mildest cautionary note placed on cigarette packs? The first such caution cited a connection between cigarette smoking and lung cancer. After years of further controlled study, researchers were finally able to put stronger cautions on cigarette packs: that cigarette smoking causes lung cancer.

Moral of the Story: This was an example of an article in which the author was very careful not to imply cause and effect. Other articles aren't so clear. As you read articles that involve numbers and statistics, be aware of this and don't jump to causation conclusions!

Self-Test: Name a lurking variable in this statement: "The more firefighters sent to a fire, the more damage is done to the structure."

Answer: Size of the fire is the lurking variable. It influences both variables: the amount of damage done, and the number of firefighters sent to the scene.

(Credit: Stats: Modeling the World, Bock, Velleman, deVeaux. Thanks!

Percentiles and Quartiles

2011-07-29T06:10:00.000-07:00

Suppose you just received your GMAT scores and found that your score was at the 85th percentile. A percentile is a number that tells you what percent of scores are below yours. So, in this example, being at the 85th percentile means that 85% of all the GMAT scores were below yours. Good job!

Suppose you take your baby in for a checkup with her pediatrician, who tells you that your baby's weight is at the 70th percentile. This means that 70% of all babies weigh less than yours. Not too hard, right?

"Percentile" contains the word percent, which is a number out of 100. This means that the list of the numbers are batched into one hundred groups: each group contains 1/100 of the values.

There are other "-iles" that use numbers other than 100. For example, think of the word decile. This implies, as with decimals, the number 10. Regarding baby weights, the list of all weights is batched into ten groups of weights. Your baby (who was at the 70th percentile) would be at the 7th decile.

In statistics, we work a lot with quartiles. This implies the number 4 (as in quarter, quartet, etc.). Inagine the numbers grouped into 4 batches (in numerical order, of course).Each quartile is the number marking off each batch. The first quartile marks the end of the lowest 25% of the numbers in the set, the second quartile marks the end of the 2nd 25% of the numbers in the set, and so on. By the way, the 2nd quartile is also known as the median, which appeared in an earlier post. The median marks the point at which half the numbers are below and half are above.

An example using real data might help out. Think of the following scores, which I've arranged in increasing order:
52, 54, 61, 63, 68, 68, 72, 75, 82, 82, 84, 93.

There are 12 scores, so divide them equally into 4 groups:
52 54 61 / 63 68 68 / 72 75 82 / 82 84 93.

The numbers dividing each of these groups -- 62, 70, 82 -- are the 1st, 2nd (median), and 3rd quartiles. (The 4th quartile is also known as the maximum.)

Or, you could first find the median of the whole list, as illustrated in my last post. That's the 2nd quartile, also known as Q2. To find Q1, the first quartile, find the median of the lower half (i.e., from the minimum to the median). To find Q3, find the median of the upper half of the scores (i.e., from median to maximum).

Just one more thing: Even though each quartile contains the same number of scores doesn't mean that the span of the scores in each quartile are equal if you were to mark each on a number line. Look at the 12 scores above. The lowest fourth ranges from 52 to 61: a span of 9 units. The next fourth is only 5 units wide (63 to 68), The one after that:is 10 units wide: from 72 to 82. The last group ranges from 82 to 93, 11 units wide. So if we were to portray the quartiles on a straight line they would appear at the slash (/) marks:

- - - - - - - - - / - - - - - / - - - - - - - - - - - / - - - - - - - - - - -
52                       62              70                              82                               93
min                    Q1              median                      Q3                              max

If I surround Q1, the median, and Q3 with a box, I get a picture similar to what's called a boxplot in statistics:
                    ___________________
|- - - - - - - - | - - - - - | - - - - - - - - - - - | - - - - - - - - - - -|
                    |______|_____________|
52                   62           70                            82                           93
min                 Q1          median                    Q3                          max
      (Excuse the crudeness of the diagram -- I confess I have a lot to learn about HTML!)

By the way, the minimum, Q1, the median, Q3, and the maximum are also known as the
5-Number Summary and are considered a great way to describe any set of numerical (quantitative) data. Here's another measure that is important: the Interquartile Range (or IQR), which is found by subtracting Q3 - Q1. The IQR gives the span of the middle 50% of the data.set. This comes in handy when we're trying to describe a set of numbers that have extreme values (outliers).

Moral of the Story: In a boxplot, each of the 4 regions bounded by the 5 numbers contain 25% of the numbers in the data set. This does not mean that each region has the same width.

Self-Test: The median represents what percentile?

Answer: Because the median marks the point at which half the numbers are below and half are above, the median represents the 50th percentile.

The Center: Mean, Median, and Mode

2011-07-27T12:04:00.000-07:00

Let's begin with what might be familiar territory: how people describe a list of numbers (data!) using a single central measure. There are three of these numbers -- mean, median, and mode -- and which one is best heavily depends on the data you're describing. Let's go over how to determine each of these measures, and discuss the pros and cons of each.

Consider the following very short list of test grades one of my students received last semester: 74, 84, 88, 71, and 88.

To find the mean (also known as the average): just add up all of the numbers and divide by how many numbers are in the set. That is: (74+84+88+71+88) divided by 5 = 405 / 5 = 80. The symbol we'll use for the mean is

         The mean is good to use when there are no extreme values (numbers that lie far
         outside the span of the other numbers. There are none in this set of numbers, so
         we say there are no outliers and so the mean is just fine to use.

To find the median (also known as the midpoint), arrange the grades in numerical order: 71 ,74, 84, 88, 88. Note that we list duplicates as many times as they occur. Once the numbers are arranged, find the number that's in the middle of the list. In this short list, it's 84.

         Now, what if there is no middle number? For example, suppose we add a sixth
         score: 94. We now have the list, in order: 71, 74, 84, 88, 88, 94. When we look
         for the middle,there isn't a single score but two: 84 and 88. In this case, find the
         average of these two numbers: (84+88}/2 = 86.

         The median is good to use almost anytime, but is especially important to use when
         there are outliers. To see why the median is more accurate than the mean as a
         central measure when there are outliers, think about the salaries of a very small
         company, from the line workers to the CEO:

                 Line worker 1:       $ 28.000
                 Line worker 2:       $ 32,500
                 Line worker 3:       $ 33,100
                 Supervisor:           $ 45,000
                 Marketing person: $ 62,300
                 Sales person:        $ 70,000
                 CEO:                    $175,000

         Compare the mean ($ 63,700) to the median $45,000. The mean isn't realistic
         because 5 out of 7 of the workers are making less! This is because the outlier,
         $175,000, inflates the calculation of the mean. On the other hand, the median is
         much more reflective of the central salary: 3 employees make more, 3 make less.

Finding the mode is easy, but it exists only if there are duplicates in the list. Simply find the number that occurs the most. In our original list of 5 test scores, 88 occurs twice, so the mode is 88. In our salary example directly above, there is no mode because no salary occurs more than once. If one number occurs twice and another number occurs four times, the latter is the mode because it occurs the most.

        The mode is probably the least useful as a measure of center. The only time it
        makes sense is if there are many occurrences of the same number, compared to
        the number of other values. Example: 57, 66, 75, 75, 75, 75, 75, 75, 82.

Moral of the Story: When you see the terms "mean" and "median" used in articles, do not assume that the writer is always using the right term. If you see the data, you can check this. Otherwise, the author might be confusing one measure for the other. Not everyone understands the difference, but now (hopefully) you do!

Self quiz: What are the measures of center (mean, median, and mode) for the following list of 8 student scores? 43, 77, 66, 73. 85, 75, 92, 81.

(Answer): Mean = 74; Median = 76; Mode = none. Which is the better measure? Technically, the median would be better because the low score of 43 is dragging the mean down a bit. However, it's pretty much a wash since the mean and median are so close to each other.

What is Statistics?

2011-07-26T13:52:00.000-07:00

In one sense, statistics is a mindset -- a way of looking at things that occur in the world. These "things" are usually called data or variables. Data can be numeric in ways that you can measure and label, like miles, dollars, feet, etc.; these are called quantitative variables. The other type of data is called categorical in that it is dividable into categories, like eye colors, levels of education, income ranges, etc.

Besides being a mindset, statistics is a set of methods for dealing with data. For example, we might have a list of stock prices for our favorite stock over the past month. We can find the average stock price, the maximum, the minimum, or the price that's smack in the middle (which is called the median). Don't worry about the meanings of all these terms...in time we will work with them all. Right now the important thing to know is that statistics consists of a Mindset and Methods.

Sometimes you have all the data in front of you and you want to analyze it; much like the stock prices above. This branch of statistics is called Descriptive Statistics. Other times, you only have some of the data and you want to draw reliable conclusions about all of the data. This branch of staitistics is called Inferential Statistics. An example of inference is the poll that tracks how many people watch various TV shows each week. Polling companies contact a subset of TV viewers using valid statistical methods that allow them to state with some confidence what TV shows we all are watching. You'd be surprised at how few viewers are needed (compared to all the viewers in the US)!

I am going to keep my posts short and sweet. My purpose, at the least, is to teach you a little bit about statistics in each post. My hopes are that I will demystify statistics for you and convince you that statistics doesn't have to be difficult. I will make this as fun and interactive as I can with real-life examples and self-quizzes. I look forward to working with you and getting your feedback to guide future posts!

Self Test: For now, select the word at the top of the page that best describes your feeling about statistics.