## Friday, July 29, 2011

### Percentiles and Quartiles

Suppose you just received your GMAT scores and found that your score was at the 85th percentile. A percentile is a number that tells you what percent of scores are below yours. So, in this example, being at the 85th percentile means that 85% of all the GMAT scores were below yours. Good job!

Suppose you take your baby in for a checkup with her pediatrician, who tells you that your baby's weight is at the 70th percentile. This means that 70% of all babies weigh less than yours. Not too hard, right?

"Percentile" contains the word percent, which is a number out of 100. This means that the list of the numbers are batched into one hundred groups: each group contains 1/100 of the values.

There are other "-iles" that use numbers other than 100. For example, think of the word decile. This implies, as with decimals, the number 10. Regarding baby weights, the list of all weights is batched into ten groups of weights. Your baby (who was at the 70th percentile) would be at the 7th decile.

In statistics, we work a lot with quartiles. This implies the number 4 (as in quarter, quartet, etc.). Inagine the numbers grouped into 4 batches (in numerical order, of course).Each quartile is the number marking off each batch. The first quartile marks the end of the lowest 25% of the numbers in the set, the second quartile marks the end of the 2nd 25% of the numbers in the set, and so on. By the way, the 2nd quartile is also known as the median, which appeared in an earlier post. The median marks the point at which half the numbers are below and half are above.

An example using real data might help out. Think of the following scores, which I've arranged in increasing order:
52, 54, 61, 63, 68, 68, 72, 75, 82, 82, 84, 93.

There are 12 scores, so divide them equally into 4 groups:
52 54 61 / 63 68 68 / 72 75 82 / 82 84 93.

The numbers dividing each of these groups -- 62, 70, 82 -- are the 1st, 2nd (median), and 3rd quartiles. (The 4th quartile is also known as the maximum.)

Or, you could first find the median of the whole list, as illustrated in my last post. That's the 2nd quartile, also known as Q2. To find Q1, the first quartile, find the median of the lower half (i.e., from the minimum to the median). To find Q3, find the median of the upper half of the scores (i.e., from median to maximum).

Just one more thing: Even though each quartile contains the same number of scores doesn't mean that the span of the scores in each quartile are equal if you were to mark each on a number line. Look at the 12 scores above. The lowest fourth ranges from 52 to 61: a span of 9 units. The next fourth is only 5 units wide (63 to 68), The one after that:is 10 units wide: from 72 to 82. The last group ranges from 82 to 93, 11 units wide. So if we were to portray the quartiles on a straight line they would appear at the slash (/) marks:

- - - - - - - - - / - - - - - / - - - - - - - - - - - / - - - - - - - - - - -
52                       62              70                              82                               93
min                    Q1              median                      Q3                              max

If I surround Q1, the median, and Q3 with a box, I get a picture similar to what's called a boxplot in statistics:
___________________
|- - - - - - - - | - - - - - | - - - - - - - - - - - | - - - - - - - - - - -|
|______|_____________|
52                   62           70                            82                           93
min                 Q1          median                    Q3                          max

(Excuse the crudeness of the diagram -- I confess I have a lot to learn about HTML!)

By the way, the minimum, Q1, the median, Q3, and the maximum are also known as the
5-Number Summary and are considered a great way to describe any set of numerical (quantitative) data. Here's another measure that is important: the Interquartile Range (or IQR), which is found by subtracting Q3 - Q1. The IQR gives the span of the middle 50% of the data.set. This comes in handy when we're trying to describe a set of numbers that have extreme values (outliers).

Moral of the Story: In a boxplot, each of the 4 regions bounded by the 5 numbers contain 25% of the numbers in the data set. This does not mean that each region has the same width.

Self-Test: The median represents what percentile?

Answer: Because the median marks the point at which half the numbers are below and half are above, the median represents the 50th percentile.