Sunday, August 14, 2011

Picturing Quantitative Data

Just in case you didn't read the post "Picturing Categorical Data," here's a tiny bit of rehash: The first thing to note that different types of data are pictured in different ways. You might recall from an earlier posting that there are two types of data:
• Quantitative data are numbers for which it makes sense to do arithmetic on them. For example, test scores, dollars, miles, etc. it is meaningful to, say, find their average. Quantitative data usually have labels.
• Categorical data are sub-levels of variables that you do not combine with arithmetic. You usually count them. Examples: eye colors (bllue, brown, green, hazel), education levels (high school, undergrad, graduate, etc.). Some categorical variables are numbers, but not ones that make sense to add; examples are zip codes, numbers on athletic jerseys, phone numbers.
This posting deals with quantitative data only. We deal twith categorical data in an earlier post.

There are many ways to picture quantitative data, but let me introduce two that are most used: the boxplot and the histogram. In both cases, let's use the following set of scores, which are arranged in descending order.

 98 86 81 95 86 75 92 86 75 91 85 75 91 82 74 90 82 73 89 81 72

Boxplots

(Self-Test): In the earlier post, entitled "Percentiles and Quartiles," we discussed the 5 number summary, consisting of which 5 measures?
(Answer): the minimum, the 1st quartile (Q1), the median, the 3rd quartile (Q3), and the maximum.

A boxplot uses these 5 numbers, arranged on a number line according to their values. As we discussed, each of the four regions marked off by these 5 numbers contain 1/4 of the values in the data set, even if they don't have the same width.

There are 21 numbers shown above. The minimum is 72 and the maximum is 98. Since the number are arranged numerically, we can count to the 11th number to find the median: 85.

We now have 10 numbers on each side of the median. The 1st and 3rd quartiles are the medians of the lower 10 and upper 10 numbers, respectively. Q1 is 75, and Q3 is midway between 90 and 91. Remember: just look between the 5th and 6th number from the bottom and the top in each case.

So, our 5-number summary is: 72 (min), 75 (Q1), 85 (median), 90.5 (Q3), and 98 (max). To make a boxplot, draw a number line that stretches from, say, 70 to 100 and place the 5 numbers in their proper places. Then draw a box around the middle 3 numbers to show the span of the middle 50% of the number set:
A boxplot is helpful not only to see the spread of the data set, but also to see the symmetry of the data. For example, the stretch between the minimum and the median is about the same size as the stretch between the median and the maximum, which tells us that the mean lines up pretty well with the median. (Recall that it doesn't always end up this way!) The unevenness of the outer regions (which we call the "whiskers") are uneven, telling us that the data looks more heavily weighted toward the low side of the number line. If the boxplot looked like the following, the data would be roughly symmetric:

Symmetry in a data set is important to recognize, because we can assume other important traits. More about this in a future post.

Now, if a data set has outliers, they are shown with a symbol; for instance, a star:

In cases like this, a new maximum that lies within the boundary for outliers is determined.  As I said in a previous post, you should look for possible reasons for outliers.

Histograms

A histogram tracks the number of observations (like the values in the above table) that lie within consecutive, same-width intervals on a number line. While most people do not make histograms by hand, we'll go through how it's done so you can understand what a histogram looks like. In statistics histograms are used heavily; usually to see the shape of the data distribution.

Start by examining the data values. They range from 72 to 98, and the range (as defined in statistics) is 98 - 72 or 26. We want to divide 26 into a number of equal-sized intervals, then find out how many of the numbers in the data set lie in each interval. We will represent that number as the height of that interval's bar.

How many intervals do we need? Well, to me, that's a "squishy" question. You want enough to be able to see a shape, but not so many that you hardly have any values in each interval. A rule of thumb that I use is between 6 and 8. Because our range is 26, I'm going to use 7 for intervals, because 7*4 is 28, which is closer to 26 than a multiple of 6 or 8. So we have the following number line. Each interval will be 4 units wide (again, because 7*4 is 28), so I've numbered the line accordingly.
(Self-Test): What's with the // on the number line above?
(Answer): Number lines should always be to scale, but in this case a number line going from 0 to 100 probably wouldn't fit on the page. So, I use the // to indicate a break in the numbering.

Now I need to make a small decision: Shall I count the labelled numbers as the start or the end of the interval? It doesn't matter, just as long as you're consistent. I'm going to count the numbers as the start of each interval. So, my first interval will go from 72 to 75; my second will be from 76 to 79, and so on.

For each interval, you need to count how many values in the data set fall within that interval. For example, there are 6 numbers that are between 72 and 75, inclusive. So the height of my bar for the first interval will be 6 units high.
Our second interval, which goes from 76 to 79, has no values. So we leave a space to show that zero values are in that interval. The next interval goes from 80 to 83, and there are four values that lie in that interval. So, the bar will be 4 units high.
Continuing on for the rest of the intervals, here is our finished product:

This histogram has an interesting shape. If the dataset represents grades, we can assume that there 6 out of 21 "C" students, while twice that many are in the "B" to "A-" range. I'll be saying more about shapes of distributions in a later post. For now, suffice it to say that the shape of a distribution is an important thing to assess when you're looking at data.