Statistics Without Tears: mode

When you graph quantitative data, you can often see some kind of shape emerge. Here are some typical block structures that illustrate some possible shapes a display might take. We call the set of values "distributions."

The first thing you need to determine is if there is any symmetry to the graph. If you were to visualize a vertical line going down the center, does each side look like a mirror image of the other? No real-life distribution will be perfectly symmetrical, but if it's close, it's worth mentioning.

(Self-Test): Which three of the above graphs look symmetric?
(Answer): B, D, and E.

The next thing you might notice is that some graphs have peaks whereas others look pretty level. We call the peaks "modes." If there's one peak, we say the graph has a unimodal distribution. If there are two peaks, the graph has a bimodal distribution.

(Self-Test): Which three of the above graphs look unimodal? Which is bimodal?
(Answer): A, B, and C are unimodal; D is bimodal.

FYI, a graph that's mostly level-looking, like graph E, is called uniform.

Now take a look at distributions A and C. Do you see that each has a "tail" at one end? When a distribution is off-center (compared to graph B), we say it is skewed. The direction of the tail is the direction of the skewness. For example, graph A is skewed left because the tail is on the left side of the graph. graph C is skewed right.

When we describe a distribution we try to describe its shape, center, and spread, plus anything unusual about it, such as outliers. This will be the subject of my next post.

(Self-Test): Which of the distributions pictured above might have outliers?
(Answer): Skewed graphs, like A and C, depending on the length of their tails. The longer the tail, the more likely there are outliers.

Let's begin with what might be familiar territory: how people describe a list of numbers (data!) using a single central measure. There are three of these numbers -- mean, median, and mode -- and which one is best heavily depends on the data you're describing. Let's go over how to determine each of these measures, and discuss the pros and cons of each.

Consider the following very short list of test grades one of my students received last semester: 74, 84, 88, 71, and 88.

To find the mean (also known as the average): just add up all of the numbers and divide by how many numbers are in the set. That is: (74+84+88+71+88) divided by 5 = 405 / 5 = 80. The symbol we'll use for the mean is

         The mean is good to use when there are no extreme values (numbers that lie far
         outside the span of the other numbers. There are none in this set of numbers, so
         we say there are no outliers and so the mean is just fine to use.

To find the median (also known as the midpoint), arrange the grades in numerical order: 71 ,74, 84, 88, 88. Note that we list duplicates as many times as they occur. Once the numbers are arranged, find the number that's in the middle of the list. In this short list, it's 84.

         Now, what if there is no middle number? For example, suppose we add a sixth
         score: 94. We now have the list, in order: 71, 74, 84, 88, 88, 94. When we look
         for the middle,there isn't a single score but two: 84 and 88. In this case, find the
         average of these two numbers: (84+88}/2 = 86.

         The median is good to use almost anytime, but is especially important to use when
         there are outliers. To see why the median is more accurate than the mean as a
         central measure when there are outliers, think about the salaries of a very small
         company, from the line workers to the CEO:

                 Line worker 1:       $ 28.000
                 Line worker 2:       $ 32,500
                 Line worker 3:       $ 33,100
                 Supervisor:           $ 45,000
                 Marketing person: $ 62,300
                 Sales person:        $ 70,000
                 CEO:                    $175,000

         Compare the mean ($ 63,700) to the median $45,000. The mean isn't realistic
         because 5 out of 7 of the workers are making less! This is because the outlier,
         $175,000, inflates the calculation of the mean. On the other hand, the median is
         much more reflective of the central salary: 3 employees make more, 3 make less.

Finding the mode is easy, but it exists only if there are duplicates in the list. Simply find the number that occurs the most. In our original list of 5 test scores, 88 occurs twice, so the mode is 88. In our salary example directly above, there is no mode because no salary occurs more than once. If one number occurs twice and another number occurs four times, the latter is the mode because it occurs the most.

        The mode is probably the least useful as a measure of center. The only time it
        makes sense is if there are many occurrences of the same number, compared to
        the number of other values. Example: 57, 66, 75, 75, 75, 75, 75, 75, 82.

Moral of the Story: When you see the terms "mean" and "median" used in articles, do not assume that the writer is always using the right term. If you see the data, you can check this. Otherwise, the author might be confusing one measure for the other. Not everyone understands the difference, but now (hopefully) you do!

Self quiz: What are the measures of center (mean, median, and mode) for the following list of 8 student scores? 43, 77, 66, 73. 85, 75, 92, 81.

(Answer): Mean = 74; Median = 76; Mode = none. Which is the better measure? Technically, the median would be better because the low score of 43 is dragging the mean down a bit. However, it's pretty much a wash since the mean and median are so close to each other.

Statistics Without Tears

How's it going so far?

Tuesday, August 16, 2011

The Shape of a Quantitative Distribution

Wednesday, July 27, 2011

The Center: Mean, Median, and Mode