How's it going so far?

Sunday, July 31, 2011

The Interquartile Range and Outliers

In the previous post, I introduced percentiles and quartiles and said that the Interquartile Range (IQR) is found by subtracting Q3 (the third quartile) minus Q1 (the first quartile). The IQR is significant because it tells us where the middle 50% of the numbers in the data set lie. This is especially helpful to know if there are extreme values at the low and/or high end of the data set.

Take the very short data set 40, 80, 86. 88, 90. Notice that the 40 is extremely far away from the other numbers; it is considered an  extreme value. It would make the average (mean) unreasonably small with respect to the majority of the other numbers. We call 40 in this set an outlier. Outliers can occur by coincidence but just as often there is some reason behind them. Statisticians should at least try to see if there's an explanation when they run into an outlier. Then they can analyze the data set with that in mind. Sometimes it's good to do two analyses: one with the outlier, and one without it.

How do we tell if a number is extreme enough to be an outlier? Well, believe it or not, there is a mathematical way and it's pretty straightforward. To test for outliers, we need to know what the 1st and 3rd quartiles are, so we can compute the IQR. Here are the steps:
  1. Find the IQR.
  2. Multiply the IQR by 1.5.
  3. Add the resulting number to Q3 to get an upper boundary for outliers.
  4. Subtract the same resulting number (from #2) from Q1 to get a lower boundary for outliers.
  5. If a number in the data set lies beyond either boundary, it is considered an outlier.
In the example above (40, 80, 86, 88, 100), Q1 is 80 and Q3 is 88. Going through the steps...

  1. Q3 - Q1 = 88 - 80 = 8 (The IQR)
  2. IQR * 1.5 = 8 * 1.5 = 12
  3. Q3 + 12 = 88 + 12 = 100. This is the upper boundary for outliers. Since our maximum is right at 100, we have no high outliers.
  4. Q1 - 12 = 80 - 12 = 68. This is the lower boundary for outliers. Since our minimum number, 40, is less than 68, our data set has one low outlier: 40.
Note that there can be any number of outliers: one on each side, two high ones and no low ones, and so forth. If there are outliers, we note them as such, try to explain them if possible, and restate and minimum and maximum at the boundaries we found.

Moral of the Story: How do we try to explain outliers? The best way is to examine the data: where it came from, what it represents, and where and how it was collected, In other words, look at the context. An outlier could turn out to be a simple mis-keying error, or might indicate something more serious. Don't make something up! But do have a look to see if there's something obvious going on.

Saturday, July 30, 2011

Connection or Causation?

I was reading an article the other day about a possible connection between mental illness and nose jobs, and thought, "Aha! A perfect possible example of crazy statistics!" But I was not only disappointed on that count, but also pleasantly surprised that the article showed valid statistical practices. What I can focus on in this post are two things:
  1. What valid statistical practices were used?
  2. How to avoid common pitfalls when reading an article that involves statistics.
First, you might want to quickly read the article:
Go ahead; I'll wait.

There are enough bad statistical articles out there, which I'm sure I'll happen upon and grab for another post, but as I said, this article is not one of them. First, the author was careful to qualify the difference between people who have valid medical reasons to seek plastic surgery on their noses and those who appear to be negatively obsessed with a perfectly normal-looking (from a plastic surgeon's standpoint) nose. The author was careful to separate the two.

This was a "controlled" study, as statisticians say. First, there was a rather large sample (266) of patients seeking nose jobs in Belgium, who filled out a diagnostic questionnaire that was intended to uncover the condition (called Body Dysmorpphic Disorder (BDD)). The author included all of the vital information -- number of people, location of the study, duration of the study, and a link to the full journal article so that the study could be reproduced if desired. Rather than worrying that a duplicate study will refute the findings of the original study, most statisticians welcome "do-overs" of the study, because it can either strengthen their findings or point out something they might have overlooked. After all, from an ethical standpoint, most researchers are after the truth, so they try to make the specifics as transparent as possible.

Separating those patients who had a "valid" complaint about their nose (for instance, a breathing problem) from those who showed signs of BDD was an excellent example of controlling the study. For example, if they hadn't done this and went on conducting the study, they couldn't have separated the "valid" patients from the BDD patients later. By controlling the study, they were able to determine that only 2% of the "valid" patients showed evidence of BDD while 43% of the "invalid" patients did. If everyone had been lumped together, the results might have been diluted. Researchers need to think of possible variables ahead of time that could muddy the waters later. The variable of "valid" versus "invalid" was one of these. We call these variables lurking variables, because they can work under the surface, making it look like one thing (seeking a nose job) is causing the other (BDD)... or vice versa.

Another interesting and admirable thing the article did was volunteer extra information that the reader might have had upon reading the article; that is, they singled out nose jobs from other plastic surgery as being notable with respect to this connection. They seem to have thought things through very well.

Now, on to you as a reader of such articles. My only caution in this case is that you don't read any type of cause and effect situation into the article. Having BDD doesn't necessarily cause people to seek a nose job, although there seems to be a very strong connection between the two. Further study and repetition of the study would be needed to prove cause and effect. To its credit, the author of this article was careful not to lead you in the causation direction.

Simply put: Just because two variables appear to be connected doesn't prove that one variable causes another. For those readers who recall the "old days" when people smoked cigarettes and didn't know about the dangers of lung cancer, remember how it took years to get even the mildest cautionary note placed on cigarette packs? The first such caution cited a connection between cigarette smoking and lung cancer. After years of further controlled study, researchers were finally able to put stronger cautions on cigarette packs: that cigarette smoking causes lung cancer.

Moral of the Story: This was an example of an article in which the author was very careful not to imply cause and effect. Other articles aren't so clear. As you read articles that involve numbers and statistics, be aware of this and don't jump to causation conclusions!

Self-Test: Name a lurking variable in this statement: "The more firefighters sent to a fire, the more damage is done to the structure."

Answer: Size of the fire is the lurking variable. It influences both variables: the amount of damage done, and the number of firefighters sent to the scene.

(Credit: Stats: Modeling the World, Bock, Velleman, deVeaux. Thanks!

Friday, July 29, 2011

Percentiles and Quartiles

Suppose you just received your GMAT scores and found that your score was at the 85th percentile. A percentile is a number that tells you what percent of scores are below yours. So, in this example, being at the 85th percentile means that 85% of all the GMAT scores were below yours. Good job!

Suppose you take your baby in for a checkup with her pediatrician, who tells you that your baby's weight is at the 70th percentile. This means that 70% of all babies weigh less than yours. Not too hard, right?

"Percentile" contains the word percent, which is a number out of 100. This means that the list of the numbers are batched into one hundred groups: each group contains 1/100 of the values.

There are other "-iles" that use numbers other than 100. For example, think of the word decile. This implies, as with decimals, the number 10. Regarding baby weights, the list of all weights is batched into ten groups of weights. Your baby (who was at the 70th percentile) would be at the 7th decile.

In statistics, we work a lot with quartiles. This implies the number 4 (as in quarter, quartet, etc.). Inagine the numbers grouped into 4 batches (in numerical order, of course).Each quartile is the number marking off each batch. The first quartile marks the end of the lowest 25% of the numbers in the set, the second quartile marks the end of the 2nd 25% of the numbers in the set, and so on. By the way, the 2nd quartile is also known as the median, which appeared in an earlier post. The median marks the point at which half the numbers are below and half are above.

An example using real data might help out. Think of the following scores, which I've arranged in increasing order:
52, 54, 61, 63, 68, 68, 72, 75, 82, 82, 84, 93.

There are 12 scores, so divide them equally into 4 groups:
52 54 61 / 63 68 68 / 72 75 82 / 82 84 93.

The numbers dividing each of these groups -- 62, 70, 82 -- are the 1st, 2nd (median), and 3rd quartiles. (The 4th quartile is also known as the maximum.)

Or, you could first find the median of the whole list, as illustrated in my last post. That's the 2nd quartile, also known as Q2. To find Q1, the first quartile, find the median of the lower half (i.e., from the minimum to the median). To find Q3, find the median of the upper half of the scores (i.e., from median to maximum).

Just one more thing: Even though each quartile contains the same number of scores doesn't mean that the span of the scores in each quartile are equal if you were to mark each on a number line. Look at the 12 scores above. The lowest fourth ranges from 52 to 61: a span of 9 units. The next fourth is only 5 units wide (63 to 68), The one after that:is 10 units wide: from 72 to 82. The last group ranges from 82 to 93, 11 units wide. So if we were to portray the quartiles on a straight line they would appear at the slash (/) marks:

  - - - - - - - - - / - - - - - / - - - - - - - - - - - / - - - - - - - - - - -
52                       62              70                              82                               93
min                    Q1              median                      Q3                              max

If I surround Q1, the median, and Q3 with a box, I get a picture similar to what's called a boxplot in statistics:
  |- - - - - - - - | - - - - - | - - - - - - - - - - - | - - - - - - - - - - -|
52                   62           70                            82                           93
min                 Q1          median                    Q3                          max

      (Excuse the crudeness of the diagram -- I confess I have a lot to learn about HTML!)

By the way, the minimum, Q1, the median, Q3, and the maximum are also known as the
5-Number Summary and are considered a great way to describe any set of numerical (quantitative) data. Here's another measure that is important: the Interquartile Range (or IQR), which is found by subtracting Q3 - Q1. The IQR gives the span of the middle 50% of the data.set. This comes in handy when we're trying to describe a set of numbers that have extreme values (outliers).

Moral of the Story: In a boxplot, each of the 4 regions bounded by the 5 numbers contain 25% of the numbers in the data set. This does not mean that each region has the same width.

Self-Test: The median represents what percentile?

Answer: Because the median marks the point at which half the numbers are below and half are above, the median represents the 50th percentile.

Wednesday, July 27, 2011

The Center: Mean, Median, and Mode

Let's begin with what might be familiar territory: how people describe a list of numbers (data!) using a single central measure. There are three of these numbers -- mean, median, and mode -- and which one is best heavily depends on the data you're describing. Let's go over how to determine each of these measures, and discuss the pros and cons of each.

Consider the following very short list of test grades one of my students received last semester: 74, 84, 88, 71, and 88.

  • To find the mean (also known as the average): just add up all of the numbers and divide by how many numbers are in the set. That is: (74+84+88+71+88) divided by 5 = 405 / 5 = 80. The symbol we'll use for the mean is

         The mean is good to use when there are no extreme values (numbers that lie far
         outside the span of the other numbers. There are none in this set of numbers, so 
         we say there are no outliers and so the mean is just fine to use. 
  • To find the median (also known as the midpoint), arrange the grades in numerical order: 71 ,74, 84, 88, 88. Note that we list duplicates as many times as they occur. Once the numbers are arranged, find the number that's in the middle of the list. In this short list, it's 84.
         Now, what if there is no middle number? For example, suppose we add a sixth   
         score: 94. We now have the list, in order: 71, 74, 84, 88, 88, 94. When we look  
         for the middle,there isn't a single score but two: 84 and 88. In this case, find the  
         average of these two numbers: (84+88}/2 = 86.

         The median is good to use almost anytime, but is especially important to use when
         there are outliers. To see why the median is more accurate than the mean as a
         central measure when there are outliers, think about the salaries of a very small
         company, from the line workers to the CEO:

                 Line worker 1:       $  28.000
                 Line worker 2:       $  32,500
                 Line worker 3:       $  33,100
                 Supervisor:           $  45,000
                 Marketing person: $  62,300
                 Sales person:        $  70,000
                 CEO:                    $175,000

         Compare the mean ($ 63,700) to the median $45,000. The mean isn't realistic    
         because 5 out of 7 of the workers are making less! This is because the outlier,
         $175,000, inflates the calculation of the mean. On the other hand, the median is
         much more reflective of the central salary: 3 employees make more, 3 make less.

  • Finding the mode is easy, but it exists only if there are duplicates in the list. Simply find the number that occurs the most. In our original list of 5 test scores, 88 occurs twice, so the mode is 88. In our salary example directly above, there is no mode because no salary occurs more than once. If one number occurs twice and another number occurs four times, the latter is the mode because it occurs the most.
        The mode is probably the least useful as a measure of center. The only time it 
        makes sense is if there are many occurrences of the same number, compared to   
        the number of other values. Example: 57, 66, 75, 75, 75, 75, 75, 75, 82.

Moral of the Story: When you see the terms "mean" and "median" used in articles, do not assume that the writer is always using the right term. If you see the data, you can check this. Otherwise, the author might be confusing one measure for the other. Not everyone understands the difference, but now (hopefully) you do!

Self quiz: What are the measures of center (mean, median, and mode) for the following list of 8 student scores? 43, 77, 66, 73. 85, 75, 92, 81.

(Answer): Mean = 74; Median = 76; Mode = none. Which is the better measure? Technically, the median would be better because the low score of 43 is dragging the mean down a bit. However, it's pretty much a wash since the mean and median are so close to each other. 

Tuesday, July 26, 2011

What is Statistics?

In one sense, statistics is a mindset -- a way of looking at things that occur in the world. These "things" are usually called data  or variables. Data can be numeric in ways that you can measure and label, like miles, dollars, feet, etc.; these are called quantitative variables. The other type of data is called categorical in that it is dividable into categories, like eye colors, levels of education, income ranges, etc.

Besides being a mindset, statistics is a set of methods for dealing with data. For example, we might have a list of stock prices for our favorite stock over the past month. We can find the average stock price, the maximum, the minimum, or the price that's smack in the middle (which is called the median). Don't worry about the meanings of all these time we will work with them all. Right now the important thing to know is that statistics consists of a Mindset and Methods.

Sometimes you have all the data in front of you and you want to analyze it; much like the stock prices above. This branch of statistics is called Descriptive Statistics. Other times, you only have some of the data and you want to draw reliable conclusions about all of the data. This branch of staitistics is called Inferential Statistics. An example of inference is the poll that tracks how many people watch various TV shows each week. Polling companies contact a subset of TV viewers using valid statistical methods that allow them to state with some confidence what TV shows we all are watching. You'd be surprised at how few viewers are needed (compared to all the viewers in the US)!

I am going to keep my posts short and sweet. My purpose, at the least, is to teach you a little bit about statistics in each post. My hopes are that I will demystify statistics for you and convince you that statistics doesn't have to be difficult. I will make this as fun and interactive as I can with real-life examples and self-quizzes. I look forward to working with you and getting your feedback to guide future posts!

Self Test: For now, select the word at the top of the page that best describes your feeling about statistics.