Take the very short data set 40, 80, 86. 88, 90. Notice that the 40 is extremely far away from the other numbers; it is considered an extreme value. It would make the average (mean) unreasonably small with respect to the majority of the other numbers. We call 40 in this set an outlier. Outliers can occur by coincidence but just as often there is some reason behind them. Statisticians should at least try to see if there's an explanation when they run into an outlier. Then they can analyze the data set with that in mind. Sometimes it's good to do two analyses: one with the outlier, and one without it.
How do we tell if a number is extreme enough to be an outlier? Well, believe it or not, there is a mathematical way and it's pretty straightforward. To test for outliers, we need to know what the 1st and 3rd quartiles are, so we can compute the IQR. Here are the steps:
- Find the IQR.
- Multiply the IQR by 1.5.
- Add the resulting number to Q3 to get an upper boundary for outliers.
- Subtract the same resulting number (from #2) from Q1 to get a lower boundary for outliers.
- If a number in the data set lies beyond either boundary, it is considered an outlier.
- Q3 - Q1 = 88 - 80 = 8 (The IQR)
- IQR * 1.5 = 8 * 1.5 = 12
- Q3 + 12 = 88 + 12 = 100. This is the upper boundary for outliers. Since our maximum is right at 100, we have no high outliers.
- Q1 - 12 = 80 - 12 = 68. This is the lower boundary for outliers. Since our minimum number, 40, is less than 68, our data set has one low outlier: 40.
Moral of the Story: How do we try to explain outliers? The best way is to examine the data: where it came from, what it represents, and where and how it was collected, In other words, look at the context. An outlier could turn out to be a simple mis-keying error, or might indicate something more serious. Don't make something up! But do have a look to see if there's something obvious going on.