Statistics Without Tears: IQR

Showing posts with label IQR. Show all posts

Saturday, August 20, 2011

Analyzing Quantitative Distributions

This post deals only with quantitative data.

When you have a quantitative dataset, it is always a good idea to look at a graphical display of it: usually a histogram or boxplot, although there are others. What you are looking to describe is the shape of the data (the subject of a recent post), the approximate center of the dataset, the spread of the values, and any unusual features (such as extremely low or high values -- outliers). We'll take them one at a time.

Shape
The shape of the dataset helps us determine how to report on the other features. We went into detail about shape in a very recent post ("The Shape of a Quantitative Distribution"). If the display is basically symmetric, you will use the mean to describe the center and a measure called the Standard deviation to describe the spread. If the display is non-symmetric, you will use the median to describe the center and the interquartile range to describe the spread. Here's a handy chart to summarize this.

Center
In a symmetric distribution the mean and median are at approximately the same place; however, statisticians use the mean. In a non-symmetric distribution, the median is used because by its very definition, it is not calculated using any extreme points or points that skew the calculation.

Spread
Notice that the range (which is the difference between the largest and smallest data value in the set) is not generally used to describe the spread. In a symmetric distribution we use the standard deviation, which has a complicated formula but a simple description: the Standard Deviation is the average squared difference between each value in the dataset and the mean of the dataset. You can find the standard deviation using a graphing calculator or a tool like MS-Excel. The symbol for Standard Deviation is

Standard deviations are relative. By that I mean that you can't tell just by looking at its value whether it's large or small -- it depends on the values in the dataset. Standard deviations are good for comparing spreads if you have two distributions. And, we'll soon find out that the standard deviation has an extremely important use in statistics. Here's an example for finding the mean and standard deviation using MS-Excel:

In a non-symmetric distribution we use the Interquartile Range (IQR) because this tells the spread of the central 50% of the data values. Like the median, the IQR isn't influenced by outliers or skewness. Here's an example of finding the median and IQR using MS-Excel:

Unusual Features
As mentioned above, unusual features include extreme points (if any), also known as outliers. In an earlier post, we covered how to determine the boundaries for outliers. If a data value lies outside the boundaries, we call it an outlier. If a value isn't quite an outlier but close, it's worth mentioning in a description of a distribution. Just call it an "extreme point."

Sunday, July 31, 2011

The Interquartile Range and Outliers

In the previous post, I introduced percentiles and quartiles and said that the Interquartile Range (IQR) is found by subtracting Q3 (the third quartile) minus Q1 (the first quartile). The IQR is significant because it tells us where the middle 50% of the numbers in the data set lie. This is especially helpful to know if there are extreme values at the low and/or high end of the data set.

Take the very short data set 40, 80, 86. 88, 90. Notice that the 40 is extremely far away from the other numbers; it is considered an extreme value. It would make the average (mean) unreasonably small with respect to the majority of the other numbers. We call 40 in this set an outlier. Outliers can occur by coincidence but just as often there is some reason behind them. Statisticians should at least try to see if there's an explanation when they run into an outlier. Then they can analyze the data set with that in mind. Sometimes it's good to do two analyses: one with the outlier, and one without it.

How do we tell if a number is extreme enough to be an outlier? Well, believe it or not, there is a mathematical way and it's pretty straightforward. To test for outliers, we need to know what the 1st and 3rd quartiles are, so we can compute the IQR. Here are the steps:

Find the IQR.
Multiply the IQR by 1.5.
Add the resulting number to Q3 to get an upper boundary for outliers.
Subtract the same resulting number (from #2) from Q1 to get a lower boundary for outliers.
If a number in the data set lies beyond either boundary, it is considered an outlier.

In the example above (40, 80, 86, 88, 100), Q1 is 80 and Q3 is 88. Going through the steps...

Q3 - Q1 = 88 - 80 = 8 (The IQR)
IQR * 1.5 = 8 * 1.5 = 12
Q3 + 12 = 88 + 12 = 100. This is the upper boundary for outliers. Since our maximum is right at 100, we have no high outliers.
Q1 - 12 = 80 - 12 = 68. This is the lower boundary for outliers. Since our minimum number, 40, is less than 68, our data set has one low outlier: 40.

Note that there can be any number of outliers: one on each side, two high ones and no low ones, and so forth. If there are outliers, we note them as such, try to explain them if possible, and restate and minimum and maximum at the boundaries we found.

Moral of the Story: How do we try to explain outliers? The best way is to examine the data: where it came from, what it represents, and where and how it was collected, In other words, look at the context. An outlier could turn out to be a simple mis-keying error, or might indicate something more serious. Don't make something up! But do have a look to see if there's something obvious going on.

Friday, July 29, 2011

Percentiles and Quartiles

Suppose you just received your GMAT scores and found that your score was at the 85th percentile. A percentile is a number that tells you what percent of scores are below yours. So, in this example, being at the 85th percentile means that 85% of all the GMAT scores were below yours. Good job!

Suppose you take your baby in for a checkup with her pediatrician, who tells you that your baby's weight is at the 70th percentile. This means that 70% of all babies weigh less than yours. Not too hard, right?

"Percentile" contains the word percent, which is a number out of 100. This means that the list of the numbers are batched into one hundred groups: each group contains 1/100 of the values.

There are other "-iles" that use numbers other than 100. For example, think of the word decile. This implies, as with decimals, the number 10. Regarding baby weights, the list of all weights is batched into ten groups of weights. Your baby (who was at the 70th percentile) would be at the 7th decile.

In statistics, we work a lot with quartiles. This implies the number 4 (as in quarter, quartet, etc.). Inagine the numbers grouped into 4 batches (in numerical order, of course).Each quartile is the number marking off each batch. The first quartile marks the end of the lowest 25% of the numbers in the set, the second quartile marks the end of the 2nd 25% of the numbers in the set, and so on. By the way, the 2nd quartile is also known as the median, which appeared in an earlier post. The median marks the point at which half the numbers are below and half are above.

An example using real data might help out. Think of the following scores, which I've arranged in increasing order:
52, 54, 61, 63, 68, 68, 72, 75, 82, 82, 84, 93.

There are 12 scores, so divide them equally into 4 groups:
52 54 61 / 63 68 68 / 72 75 82 / 82 84 93.

The numbers dividing each of these groups -- 62, 70, 82 -- are the 1st, 2nd (median), and 3rd quartiles. (The 4th quartile is also known as the maximum.)

Or, you could first find the median of the whole list, as illustrated in my last post. That's the 2nd quartile, also known as Q2. To find Q1, the first quartile, find the median of the lower half (i.e., from the minimum to the median). To find Q3, find the median of the upper half of the scores (i.e., from median to maximum).

Just one more thing: Even though each quartile contains the same number of scores doesn't mean that the span of the scores in each quartile are equal if you were to mark each on a number line. Look at the 12 scores above. The lowest fourth ranges from 52 to 61: a span of 9 units. The next fourth is only 5 units wide (63 to 68), The one after that:is 10 units wide: from 72 to 82. The last group ranges from 82 to 93, 11 units wide. So if we were to portray the quartiles on a straight line they would appear at the slash (/) marks:

- - - - - - - - - / - - - - - / - - - - - - - - - - - / - - - - - - - - - - -
52                       62              70                              82                               93
min                    Q1              median                      Q3                              max

If I surround Q1, the median, and Q3 with a box, I get a picture similar to what's called a boxplot in statistics:
                    ___________________
|- - - - - - - - | - - - - - | - - - - - - - - - - - | - - - - - - - - - - -|
                    |______|_____________|
52                   62           70                            82                           93
min                 Q1          median                    Q3                          max
      (Excuse the crudeness of the diagram -- I confess I have a lot to learn about HTML!)

By the way, the minimum, Q1, the median, Q3, and the maximum are also known as the
5-Number Summary and are considered a great way to describe any set of numerical (quantitative) data. Here's another measure that is important: the Interquartile Range (or IQR), which is found by subtracting Q3 - Q1. The IQR gives the span of the middle 50% of the data.set. This comes in handy when we're trying to describe a set of numbers that have extreme values (outliers).

Moral of the Story: In a boxplot, each of the 4 regions bounded by the 5 numbers contain 25% of the numbers in the data set. This does not mean that each region has the same width.

Self-Test: The median represents what percentile?

Answer: Because the median marks the point at which half the numbers are below and half are above, the median represents the 50th percentile.

How's it going so far?