How's it going so far?

Sunday, July 31, 2011

The Interquartile Range and Outliers

In the previous post, I introduced percentiles and quartiles and said that the Interquartile Range (IQR) is found by subtracting Q3 (the third quartile) minus Q1 (the first quartile). The IQR is significant because it tells us where the middle 50% of the numbers in the data set lie. This is especially helpful to know if there are extreme values at the low and/or high end of the data set.

Take the very short data set 40, 80, 86. 88, 90. Notice that the 40 is extremely far away from the other numbers; it is considered an  extreme value. It would make the average (mean) unreasonably small with respect to the majority of the other numbers. We call 40 in this set an outlier. Outliers can occur by coincidence but just as often there is some reason behind them. Statisticians should at least try to see if there's an explanation when they run into an outlier. Then they can analyze the data set with that in mind. Sometimes it's good to do two analyses: one with the outlier, and one without it.

How do we tell if a number is extreme enough to be an outlier? Well, believe it or not, there is a mathematical way and it's pretty straightforward. To test for outliers, we need to know what the 1st and 3rd quartiles are, so we can compute the IQR. Here are the steps:
  1. Find the IQR.
  2. Multiply the IQR by 1.5.
  3. Add the resulting number to Q3 to get an upper boundary for outliers.
  4. Subtract the same resulting number (from #2) from Q1 to get a lower boundary for outliers.
  5. If a number in the data set lies beyond either boundary, it is considered an outlier.
In the example above (40, 80, 86, 88, 100), Q1 is 80 and Q3 is 88. Going through the steps...

  1. Q3 - Q1 = 88 - 80 = 8 (The IQR)
  2. IQR * 1.5 = 8 * 1.5 = 12
  3. Q3 + 12 = 88 + 12 = 100. This is the upper boundary for outliers. Since our maximum is right at 100, we have no high outliers.
  4. Q1 - 12 = 80 - 12 = 68. This is the lower boundary for outliers. Since our minimum number, 40, is less than 68, our data set has one low outlier: 40.
Note that there can be any number of outliers: one on each side, two high ones and no low ones, and so forth. If there are outliers, we note them as such, try to explain them if possible, and restate and minimum and maximum at the boundaries we found.

Moral of the Story: How do we try to explain outliers? The best way is to examine the data: where it came from, what it represents, and where and how it was collected, In other words, look at the context. An outlier could turn out to be a simple mis-keying error, or might indicate something more serious. Don't make something up! But do have a look to see if there's something obvious going on.

4 comments:

  1. Hi, I think now I have a strong hold over the topic after going through the post. The subject that you have discussed in the post is really amazing; I will surely come back for more information.

    Sorn
    www.gofastek.com

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Good explanation. Does it apply to Non-Normal distribution data set? What is the sample size it applies to? please advise.
    www.foolyourenemy.com
    http://www.foolyourenemy.com

    ReplyDelete
  4. blah blah blah , blah blah blah math blah blah lunch blah blah

    ReplyDelete