Statistics Without Tears: standard deviation

Showing posts with label standard deviation. Show all posts

Monday, September 5, 2011

The Standard Normal Model: Standardizing Scores

For the purposes of this post, we will refer to the data values in a data set as "scores."

In the last post, we used an example N(14, 2) to illustrate the 68-95-99.7 Rule, which stands for the various percents of scores lying within 1, 2, and 3 standard deviations from the mean. We can generalize the diagram we used to represent N(0, 1), where 0 is the mean and 1 is the standard deviation. This makes the model easier to apply, because the units we're most accustomed to seeing -- -1, 0,1,2, and so on -- appear as standard deviation units. Take a look:

We call N(0,1) the Standard Normal model. We now can use the number line to locate points that are any number of standard deviations from the mean...even fractional numbers.

In any Normal model, we're going to want to see what how many standard deviations a particular score in the data set is from its mean. We can do this for any score, and it has to do with converting a "raw" score (a score from our data) to a "standardized" score (a score from the Standard Normal model. How do we do this? There's a formula, and it's really an easy one:

...where
X stands for the score you're trying to convert,

stands for the mean, and

stands for the standard deviation.

Suppose we're back in the Normal model N(14, 2) and we want to see how many standard deviations a score of 15 is from the mean. We would subtract our mean from 15, then divide by the standard deviation, 2. That is: (15 - 14) / 2 = 0.5. This mean that our score of 15 is 0.5 standard deviations from the mean. 0.5 is the score on the Standard Normal model that represents our score from N(14, 2). We call 0.5 our standardized score, also known as a z-score. Z-scores tell us how many standard deviations a given "raw" score is from the mean.

(Self-Test): In the Normal model N(50, 4), standardize a score of 55.
(Answer): Find the z-score using the formula:

z = (55 - 50) / 4 = 5 / 4 = 1.25. The score is 1.25 standard deviations above the mean.

(Self-Test): In the Normal model N(50, 4), find the z-score for 42.
(Answer): z-scores can be negative, too. in this problem, 42 is less than the mean. So it will lie to the left of 50, and its z-score will be negative. z = (42 - 50) / 4 = -8 / 4 = -2. The score is 2 standard deviations below the mean.

Why would we want to standardize our scores? There are actually two reasons.

1. It can help us see how unusual a score might be.

How? Well, the percents we have been talking about can also be thought of as probabilities. For example, the probability that a score is greater than the mean is 50%, the same as the probability that a score is less than the mean. In the last post, we marked off regions and computed percents. In statistics, we consider any z-score of 3 or more, or -3 or less, as unusual, because (as we saw in the last post) only 0.15% of scores are in each of those regions. In other words, the probability of seeing a score in one of these regions is at most 0.15%, less than even a quarter-percent. That's unusual.

2. It can allow us to compare apples to oranges.

How? Well, suppose you have just gotten back 2 tests you took: one in algebra and one in earth science. Suppose further that both score distributions follow a [different] Normal model. The algebra test's scores follow N(80, 5) and the earth science test's scores follow N(85, 8). Now imagine that you got a 90 on the algebra test and a 93 on the earth science test. You can easily see that percentage-wise, your score on the earth science test is higher than your algebra score. But relative to the distributions, on which test did you perform better?

To find out, figure out the z-score for each test score. For your algebra test, your z-score is (90 - 80) / 5 = 10 / 5 = 2 (2 standard deviations above the mean). For your earth science test, your z-score is (93 - 85) / 8 = 8 / 8 = 1 (1 standard deviation above the mean. Relatively speaking, your performance was better on the algebra test; that is, your score was more exceptional. Think of the probabilities. The probability of a score that's 2 standard deviations or more above the mean is 2.5%, whereas a score that's 1 standard deviation or more above the mean is 16%. Get the idea?

(Self-Test) Suppose Tom's algebra test score was 86 and his earth science test score was 86. In which test did Tom perform better, given that the test scores follow the Normal models we used above?
(Answer): Tom's z-score on the algebra test was z = (86 - 80) / 5 = 6/5 = 1.20. His z-score on the earth science test was z = (86 - 85) / 8 = 1/8 = 0.125. Because Tom's z-score on his algebra test (1.20) is higher than his earth science test z-score (0.125), his algebra performance was better than his earth science performance.

One more thing...let's use the z-score formula to go backwards: to convert a standardized score back to a raw score. Suppose, Nancy earned a score on the algebra test that was 1.6 standard deviations below the mean. (In other words, her z-score was 1.6.) What actual percentage score would that represent for her, assuming N(80, 5)? In this case, we would start with the z-score formula and fill in what we know. Then, using algebra (!) we would solve for the "raw" score...

Nancy's algebra test score was 72.
In the next post, we'll expand on the probability side of the Standard Normal model.

Wednesday, August 31, 2011

More About the Normal Distribution: the 68-95-99.7 Rule

In the last post, we covered the Normal distribution as it relates to the standard deviation. We said that the Normal distribution is really a family of unimodal, symmetric distributions that differ only by their means and standard deviations. Now is a good time to introduce a new term: Parameter. When dealing with perfect-world models like the Normal model, their major measures -- in this case, their mean and standard deviation -- are called parameters. The mean (denoted by the Greek letter "mu" (pronounced "mew"), µ, and the standard deviation is denoted by the Greek letter sigma, σ. We can refer to a particular normal model by identifying µ and σ and using the letter N, for "Normal:" N(µ, σ). For example, if I want to describe a Normal model whose mean is 14 and whose standard deviation is 2, I use the notation N(14, 2).

Every Normal model has some pretty interesting properties, which we will now cover. Take the above model N(14,2). We'll draw it on a number line centered at 14, with units of 2 marked off in either direction. Each unit of 2 is one standard deviation in length. It would look like this...

Now let's mark off the area that's between 12 and 16; that is, the area that's within one standard deviation of the mean. In a Normal model, this region will contain 68% of the data values:

If you've ever heard of "grading on a curve," it's based on Normal models. Scores within one standard deviation of the mean would generally be considered in the "C" grade range.

Now, if you consider the region that lies within two standard deviations from the mean; that is, between 10 and 18 in this model, this area would encompass 95% of all the data values in the data set:

From a grading curve standpoint, 95% of the values would be Bs or Cs. Finally, if you mark off the area within 3 standard deviations from the mean, this region will contain about 99.7% of the values in the data set. These extremities would be the As and Fs in our grading curve interpretation.

In statistics, this percentage breakdown is called the "Empirical Rule," or the "68-95-99.7 Rule."

What about the regions up to 8 and beyond 20? These areas account for the remaining 0.3% of the data values? That would make 0.15% on each side.

(Self-Test): What percent of data values lie between 12 and 14 in N(14,2) above?
(Answer): If 68% represents the full area between 12 and 16 and given that the Normal model is perfectly symmetric, there must be half of 68%, or 34% between 12 and 14. Likewise, there would be 34% between 14 and 16.

(Self-Test): What percent of data values lie between 16 and 18 in N(14,2) above?
(Answer): Subtract 95% minus 68% = 27%. This represents the number of data values between 10 and 12, and between 16 and 18 both. Divide by 2 and you get 13.5% in each of these regions.

If you do similar operations, the various areas break down like the following:

(Self-Test): In any Normal model what percent of data values are greater than 2 standard deviations from the mean?
(Answer): Using the above model, the question is asking for the percent of data values that are more than 18. You would add the 2.35% and the 0.15% to get the answer: 2.50%.

This is all well and good if you're looking at areas that involve a whole number of standard deviations, but what about all the in-between numbers? For example, what if you wanted to know what percent of data values are within 1.5 standard deviations of the mean? This will be the subject of a future post.

Saturday, August 27, 2011

More About Standard Deviation: The Normal Distribution

I've haven't said a lot about Standard Deviation up to now: not much more than the fact that it's a measure of spread for a symmetric dataset. However, we can also think of the standard deviation as a unit of measure of relative distance from the mean in a unimodal, symmetric distribution.

Suppose you had a perfectly symmetric, unimodal distribution. It would look like the well-known bell curve. Of course, in the real world, nothing is perfect. But in statistics, we talk about ideal distributions, known as "models." Real-life datasets can only approximate the ideal model...but we can apply many of the traits of models to them.

So let's talk about a perfectly symmetric, bell-shaped distribution for a bit. We call this model a Normal distribution, or Normal model. Because we're dealing with perfection, the mean and median are at the same point. In fact, there are an infinite number of Normal distributions with a particular mean. They only differ in width. Below are some examples of Normal models.

Notice that their widths differ. Another word for "width" is "spread"...which brings us back to Standard Deviation! Take a look at the curves above. In the center section, the shape looks like an upside down bowl, whereas the outer "legs" look like part of a right-side-up bowl. Now imagine the point at which the right-side-up parts meet the upside-down part. Look below for the two blue dots in the diagram. (P.S. They are called "points of inflection," in case you were wondering.)

As shown above, if a line is drawn down the center, the distance from that line to a blue point is the length of one standard deviation. Can you see which lengths of standard deviations in the earlier examples are larger? Smaller?

So, there are two measures that define how a particular Normal model will look: the mean and the standard deviation.

I'd be remiss if I didn't tell you that there is a formula for numerically finding the standard deviation. Luckily, there's a lot of technology out there that automatically computes this for you. (I showed you how to do this in MS-Excel in an earlier post.)

Suppose you have a list of "n" data values, and when you look at a histogram of these values, you see that the distribution is unimodal and roughly symmetric. If we call the values x1, x2, x3, etc. We compute the mean (average, remember?) and note it as x with a bar over it. Then the formula is:

What does the ∑ mean? Let's take the formula apart. First, you are finding the difference between each data value and the mean of the whole dataset. You're squaring it to make sure you're dealing only with positive values. The ∑ means you should add up all those positive squared answers, one for each value in your dataset. Once you have the sum, you divide by

(n-1), which gives you an average of all the squared differences from the mean. This measure (before taking the square root) is called the variance. When you take the square root, you have the value of the standard deviation. So, you see, the standard deviation is the square root of the squared differences from the mean.

Well, that's a lot to digest, so I'll continue with the properties of the Normal model in my next post.

Saturday, August 20, 2011

Analyzing Quantitative Distributions

This post deals only with quantitative data.

When you have a quantitative dataset, it is always a good idea to look at a graphical display of it: usually a histogram or boxplot, although there are others. What you are looking to describe is the shape of the data (the subject of a recent post), the approximate center of the dataset, the spread of the values, and any unusual features (such as extremely low or high values -- outliers). We'll take them one at a time.

Shape
The shape of the dataset helps us determine how to report on the other features. We went into detail about shape in a very recent post ("The Shape of a Quantitative Distribution"). If the display is basically symmetric, you will use the mean to describe the center and a measure called the Standard deviation to describe the spread. If the display is non-symmetric, you will use the median to describe the center and the interquartile range to describe the spread. Here's a handy chart to summarize this.

Center
In a symmetric distribution the mean and median are at approximately the same place; however, statisticians use the mean. In a non-symmetric distribution, the median is used because by its very definition, it is not calculated using any extreme points or points that skew the calculation.

Spread
Notice that the range (which is the difference between the largest and smallest data value in the set) is not generally used to describe the spread. In a symmetric distribution we use the standard deviation, which has a complicated formula but a simple description: the Standard Deviation is the average squared difference between each value in the dataset and the mean of the dataset. You can find the standard deviation using a graphing calculator or a tool like MS-Excel. The symbol for Standard Deviation is

Standard deviations are relative. By that I mean that you can't tell just by looking at its value whether it's large or small -- it depends on the values in the dataset. Standard deviations are good for comparing spreads if you have two distributions. And, we'll soon find out that the standard deviation has an extremely important use in statistics. Here's an example for finding the mean and standard deviation using MS-Excel:

In a non-symmetric distribution we use the Interquartile Range (IQR) because this tells the spread of the central 50% of the data values. Like the median, the IQR isn't influenced by outliers or skewness. Here's an example of finding the median and IQR using MS-Excel:

Unusual Features
As mentioned above, unusual features include extreme points (if any), also known as outliers. In an earlier post, we covered how to determine the boundaries for outliers. If a data value lies outside the boundaries, we call it an outlier. If a value isn't quite an outlier but close, it's worth mentioning in a description of a distribution. Just call it an "extreme point."

How's it going so far?