Statistics Without Tears: shape

Saturday, August 20, 2011

Analyzing Quantitative Distributions

This post deals only with quantitative data.

When you have a quantitative dataset, it is always a good idea to look at a graphical display of it: usually a histogram or boxplot, although there are others. What you are looking to describe is the shape of the data (the subject of a recent post), the approximate center of the dataset, the spread of the values, and any unusual features (such as extremely low or high values -- outliers). We'll take them one at a time.

Shape
The shape of the dataset helps us determine how to report on the other features. We went into detail about shape in a very recent post ("The Shape of a Quantitative Distribution"). If the display is basically symmetric, you will use the mean to describe the center and a measure called the Standard deviation to describe the spread. If the display is non-symmetric, you will use the median to describe the center and the interquartile range to describe the spread. Here's a handy chart to summarize this.

Center
In a symmetric distribution the mean and median are at approximately the same place; however, statisticians use the mean. In a non-symmetric distribution, the median is used because by its very definition, it is not calculated using any extreme points or points that skew the calculation.

Spread
Notice that the range (which is the difference between the largest and smallest data value in the set) is not generally used to describe the spread. In a symmetric distribution we use the standard deviation, which has a complicated formula but a simple description: the Standard Deviation is the average squared difference between each value in the dataset and the mean of the dataset. You can find the standard deviation using a graphing calculator or a tool like MS-Excel. The symbol for Standard Deviation is

Standard deviations are relative. By that I mean that you can't tell just by looking at its value whether it's large or small -- it depends on the values in the dataset. Standard deviations are good for comparing spreads if you have two distributions. And, we'll soon find out that the standard deviation has an extremely important use in statistics. Here's an example for finding the mean and standard deviation using MS-Excel:

In a non-symmetric distribution we use the Interquartile Range (IQR) because this tells the spread of the central 50% of the data values. Like the median, the IQR isn't influenced by outliers or skewness. Here's an example of finding the median and IQR using MS-Excel:

Unusual Features
As mentioned above, unusual features include extreme points (if any), also known as outliers. In an earlier post, we covered how to determine the boundaries for outliers. If a data value lies outside the boundaries, we call it an outlier. If a value isn't quite an outlier but close, it's worth mentioning in a description of a distribution. Just call it an "extreme point."

Tuesday, August 16, 2011

The Shape of a Quantitative Distribution

When you graph quantitative data, you can often see some kind of shape emerge. Here are some typical block structures that illustrate some possible shapes a display might take. We call the set of values "distributions."

The first thing you need to determine is if there is any symmetry to the graph. If you were to visualize a vertical line going down the center, does each side look like a mirror image of the other? No real-life distribution will be perfectly symmetrical, but if it's close, it's worth mentioning.

(Self-Test): Which three of the above graphs look symmetric?
(Answer): B, D, and E.

The next thing you might notice is that some graphs have peaks whereas others look pretty level. We call the peaks "modes." If there's one peak, we say the graph has a unimodal distribution. If there are two peaks, the graph has a bimodal distribution.

(Self-Test): Which three of the above graphs look unimodal? Which is bimodal?
(Answer): A, B, and C are unimodal; D is bimodal.

FYI, a graph that's mostly level-looking, like graph E, is called uniform.

Now take a look at distributions A and C. Do you see that each has a "tail" at one end? When a distribution is off-center (compared to graph B), we say it is skewed. The direction of the tail is the direction of the skewness. For example, graph A is skewed left because the tail is on the left side of the graph. graph C is skewed right.

When we describe a distribution we try to describe its shape, center, and spread, plus anything unusual about it, such as outliers. This will be the subject of my next post.

(Self-Test): Which of the distributions pictured above might have outliers?
(Answer): Skewed graphs, like A and C, depending on the length of their tails. The longer the tail, the more likely there are outliers.

How's it going so far?

Saturday, August 20, 2011

Analyzing Quantitative Distributions

Tuesday, August 16, 2011

The Shape of a Quantitative Distribution