tag:blogger.com,1999:blog-4642760359794495952017-10-04T04:19:24.715-07:00Statistics Without TearsWhat is Statistics? Why is it one of the most misunderstood, misused, and underappreciated mathematics fields? Read on to learn basic statistics and see it in action!Judy Cramernoreply@blogger.comBlogger17125tag:blogger.com,1999:blog-464276035979449595.post-40624088544053666632011-10-22T10:53:00.000-07:002011-10-22T10:53:43.052-07:00Judging the Fit of an LSRL<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">We have said in a previous post that the LSRL represents the best possible fitting line for the data association. But an LSRL, while best possible, might still not be very good. How do we tell?</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Continuing with our Smoking Rate versus Lung Cancer Rate example that we have been discussing, we found that the correlation was 0.94 (excellent) and the LSRL equation was:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-YFhGTl6MNqc/TmUcCXLmtpI/AAAAAAAAAD8/zJuM9dDyJ-M/s1600/yHat.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="32" src="http://3.bp.blogspot.com/-YFhGTl6MNqc/TmUcCXLmtpI/AAAAAAAAAD8/zJuM9dDyJ-M/s320/yHat.jpg" width="320" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">To judge the quality of our LSRL, we need to check two things: the residuals and R-squared. Let's take them one at a time.</span><br /><br /><span style="color: blue; font-family: Arial;"><strong><u>Residuals</u></strong></span><br /><br /><span style="color: blue; font-family: Arial;">A residual represents, for each x-value, how different the actual y-value is from the predicted value, y-hat. In other words, a residual is the distance between an actual y-value for a given x and the predicted y-value from the regression equation. In the graph below, I have identified two residuals, in orange. There is one residual for <u>each</u> x-value in the dataset; I just chose these two to keep it simple. Math-wise, a residual equals the actual y-value minus y-hat, the predicted y-value from the LSRL equation.</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-_Krzia9usvA/TmUfccV4hEI/AAAAAAAAAEA/fj5RuIwtSqA/s1600/ResidPic.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="425" src="http://2.bp.blogspot.com/-_Krzia9usvA/TmUfccV4hEI/AAAAAAAAAEA/fj5RuIwtSqA/s640/ResidPic.jpg" width="640" /></a></div><br /><span style="color: blue; font-family: Arial;">We find the residuals by plugging each x-value (smoking rates in our case) into the regression equation. That gives us a y-hat for each x. Then we subtract the y-hat value from the original y-value to get the residual. Here is a table of the residual calculations.</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-CZcyb7ljB0M/TmUf_ABKSvI/AAAAAAAAAEE/2XHaZVls_rg/s1600/ResidTable.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="223" src="http://4.bp.blogspot.com/-CZcyb7ljB0M/TmUf_ABKSvI/AAAAAAAAAEE/2XHaZVls_rg/s320/ResidTable.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">Now that we have our residuals, we graph them against the original x-values in a scatterplot. If a regression is good quality, this scatterplot will have no pattern and look completely random. Here is our residual plot:</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-6rul-joHsRI/TmUgYv8mHbI/AAAAAAAAAEI/lwLpbLyQ0Ow/s1600/ResidPlot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="194" src="http://1.bp.blogspot.com/-6rul-joHsRI/TmUgYv8mHbI/AAAAAAAAAEI/lwLpbLyQ0Ow/s320/ResidPlot.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">While there aren't a lot of points in this plot, we can say that there is no discernable pattern, so we can proceed to our other criterion: R-squared.</span><br /><br /><span style="color: blue; font-family: Arial;"><strong><u>R-Squared</u></strong></span><br /><br /><span style="color: blue; font-family: Arial;">When we square r, our correlation, we get a statistic called the Coefficient of Determination, or R-Squared. Ours is 0.94*0.94 = .8836. We read R-squared as a percent: about 88.4%. </span><br /><br /><span style="color: blue; font-family: Arial;">In reality, other variables contribute to the lung cancer rate than just the smoking rate: heredity, exposure to pollution levels, and the like. R-squared tells us the percent that <u>our</u> explanatory variable, smoking rate, contributes to the association. We say that smoking rate accounts for 88.4% of the variation in lung cancer rate in the linear relationship. That means that other variables, known or unknown, account for the remaining 12%. Not bad at all.</span><br /><br /><span style="color: blue; font-family: Arial;">Because our residuals plot looks random and our R-squared value is high, we can say our regression is of high quality and feel confident using the LSRL equation to predict lung cancer rates from smoking rates, <em>within the boundaries of our data</em>. That means all is well if we're dealing with smoking rates between 19.7 and 23.3 people per 100,000.</span><br /><br /><span style="color: blue; font-family: Arial;">What happens if we get a pattern in our residuals plot or a low R-Squared value? First, let me point out that you can get one without the other, or you can get both. In either case, your data isn't as linear as it appears and it might not be appropriate to use linear regression unless you can restate your data in a way that makes it more linear. That will be the subject of a very-near future post.</span>Judy Cramernoreply@blogger.com1tag:blogger.com,1999:blog-464276035979449595.post-57126280554430201832011-10-15T11:32:00.000-07:002011-10-15T11:32:19.014-07:00Correlation and Linear Regression<span style="font-family: Arial, Helvetica, sans-serif;">When two quantitative variables have a linear relationship, we classify the association as strong, moderate, or weak. However, we can also <u>quantify</u> the strength and direction of a linear association between two quantitative variables. To do this, we find the <strong>correlation</strong>.</span><br /><br /><span style="font-family: Arial;">We'll continue with our example from the previous post: smoking rates versus lung cancer rates from 1999 through 2007. Below are the data:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-6Z70CNMwLrg/TmUFNgv26oI/AAAAAAAAADc/e3cPWcAoFkk/s1600/BivariateTable.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="224" src="http://2.bp.blogspot.com/-6Z70CNMwLrg/TmUFNgv26oI/AAAAAAAAADc/e3cPWcAoFkk/s320/BivariateTable.jpg" width="320" /></a></div><span style="font-family: Arial;">We made a scatterplot and established that the path looked linear and very strongly positive:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Os_2j30R05c/TmUFmKQRflI/AAAAAAAAADg/-jeM4HeoEW8/s1600/Scatterplot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="186" src="http://1.bp.blogspot.com/-Os_2j30R05c/TmUFmKQRflI/AAAAAAAAADg/-jeM4HeoEW8/s320/Scatterplot.jpg" width="320" /></a></div><span style="font-family: Arial;">There is, of course, a formula for finding the correlation between two variables. However, thanks to graphing calculators and other tools like MS-Excel, we can find the correlation much more easily. First, a bit more about correlation...</span><br /><br /><span style="font-family: Arial;"> -- Correlation is a number between -1 and +1. It is called "r."</span><br /><span style="font-family: Arial;"> -- The sign of the correlation indicates whether the association is positive (upward from left </span><br /><span style="font-family: Arial;"> to right) or negative (downward from left to right)</span><br /><span style="font-family: Arial;"> -- A correlation of -1 or +1 indicates a perfect association; that is, the scatterplot points could </span><br /><span style="font-family: Arial;"> be joined to form a straight line.</span><br /><span style="font-family: Arial;"> -- A correlation of 0 means that there is absolutely no association between the variables.</span><br /><span style="font-family: Arial;"> -- Correlation has no units associated with it -- it's simply a number.</span><br /><span style="font-family: Arial;"> -- The order of the variables doesn't matter; their correlation is the same either way.</span><br /><span style="font-family: Arial;"> -- If you perform an arithmetic operation on each of the data values in one or both sets, the </span><br /><span style="font-family: Arial;"> value of the correlation is unchanged.</span><br /><br /><span style="font-family: Arial;">Below shows the MS-Excel way to find correlation. See the highlighted line.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-yDFnGQNlQrM/TmUI1J5YbnI/AAAAAAAAADk/ihVwyUFru_c/s1600/CorrelExcel.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://3.bp.blogspot.com/-yDFnGQNlQrM/TmUI1J5YbnI/AAAAAAAAADk/ihVwyUFru_c/s320/CorrelExcel.jpg" width="245" /></a></div><span style="color: blue; font-family: Arial;">0.94 (rounded) is nearly perfect. </span><br /><br /><span style="color: blue; font-family: Arial;"><u><strong>Linear Regression</strong></u></span><br /><span style="font-family: Arial;"><br /><span style="color: blue;"> When an association is linear, we can find the equation of a line that best fits the numbers in the data sets. While there are many lines that look like they fit the data points well, there is only one "best" fitting line. This line is called the <strong>least squares</strong> <strong>regression line,</strong> or <strong>LSRL</strong>. <br /></span> <span style="color: blue;">Like any line, the LSRL is in the form of y = a + bx, where "a" is the y-intercept and "b" is the slope. This line won't run through each point in the scatterplot, but be as close as possible to doing so. In fact, if we were to measure the distances between each of our points and the point on the regression line that has the same x-coordinates, the sum of these squared differences would be less than for any other line. That's what we mean by "best fit" -- what is meant by "least squares."<br /><br />Rather than get hung up on the "least squares" idea right now, let's remember to come back to it later. Right now, let's find the LSRL equation for our smoking/lung cancer data. First, assuming your scatterplot is in MS-Excel, right-click on any point in your graph and select "Add Trendline" from the drop-down menu. You will see the following sub-menu.</span> </span><br /><span style="font-family: Arial;"></span><br /><span style="font-family: Arial;"></span><br /><span style="font-family: Arial;"></span><br /><span style="font-family: Arial;"><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Z2R0wOX0Dj0/TmUOJ2ZLtLI/AAAAAAAAADo/AjVxxrwe9AM/s1600/AddTrendline.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="189" src="http://2.bp.blogspot.com/-Z2R0wOX0Dj0/TmUOJ2ZLtLI/AAAAAAAAADo/AjVxxrwe9AM/s320/AddTrendline.jpg" width="320" /></a></div><br /><span style="color: blue;">You will see the following sub-menu. Be sure you select "Linear" and "Display equation":</span></span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-DwzFbbubG4Y/TmUOrvb2S1I/AAAAAAAAADs/VI8Ges_tBRo/s1600/TrendlineOptions.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://1.bp.blogspot.com/-DwzFbbubG4Y/TmUOrvb2S1I/AAAAAAAAADs/VI8Ges_tBRo/s320/TrendlineOptions.jpg" width="242" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">When you close the above window, your trendline and equation will appear:</span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-RSqrbFPtTNE/TmUPkt-9kzI/AAAAAAAAADw/ot2sW6tKfJs/s1600/Trendline.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="185" src="http://1.bp.blogspot.com/-RSqrbFPtTNE/TmUPkt-9kzI/AAAAAAAAADw/ot2sW6tKfJs/s320/Trendline.jpg" width="320" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">The LSRL equation, in y = a + bx format and with x and y replaced by their meanings, is:</span><br /><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-hr0vuzA3rwQ/TmUTKECgr2I/AAAAAAAAAD0/DUV78rPJhkk/s1600/yHat.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="32" src="http://1.bp.blogspot.com/-hr0vuzA3rwQ/TmUTKECgr2I/AAAAAAAAAD0/DUV78rPJhkk/s320/yHat.jpg" width="320" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">The inverted "V" symbol over the response variable, LungCancerRate, is called a "hat" and means that the value is a prediction. That is, the regression line equation is a way of predicting a lung cancer rate from a given smoking rate. In fact, this is exactly why we're interested in regression equations -- to help us predict new values based on this best fitting line equation. But predicting comes with cautions. Because the data that produced this equation spans from 1999 to 2007, it is not prudent, nor is it reliable, to predict before or beyond these years -- only in between them. We cannot predict the future based on the past!</span><br /><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): Suppose that in your state, the smoking rate during one of these years was 23.75 per 100,000 people. What would you predict the lung cancer rate to be?</span></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial;">(<u>Answer</u>): You would plug 23.75, your x-value, into the LSRL equation and solve for the y-value...</span></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial;">y-hat = 22.042 + 3.0566*(23.75)</span></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial;"> = 22.042 + 72.594</span></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial;"> = 94.636 </span></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial;">The lung cancer rate is predicted to be 94.636 per 100,000 people.</span></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"><strong><u>Interpreting the Y-intercept and Slope</u></strong></span></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">It is often useful to interpret the slope and y-intercept in the context of the situation. For instance, the y-intercept is the value produced when x is zero. In our context, x stands for the smoking rate. So, we can say that when the smoking rate is zero -- that is, when a person doesn't smoke -- the lung cancer incidence was 22.042 per 100,000 <em>during these years</em>.</span></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"></span></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Now for the slope. Remember that the slope is a ratio capturing the change in the y-variable versus the change in x. In our example, that would be the change in lung cancer rates as smoking rates change. The slope in our LSRL equation is 3.0566, which we can think of as 3.0566 / 1. The change in lung cancer rates is the numerator; the change in smoking rate is the denominator. Interpreting this in the context of our situation, we can say that for each 1 in 100,000 that smokes, the lung cancer rate increased by 3.0566 per 100,000 during those years</span>.</div>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-43659963499497776952011-09-18T08:17:00.000-07:002011-09-18T08:17:30.787-07:00Associations Between Two Variables<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Up till now, we've been analyzing one set of quantitative data values, discussing center, shape, spread, and their measures. Sometimes two different sets of quantitative data values have something to do with each other, or so we suspect. This post (and a few of those that follow), deal with how to analyze two sets of data that appear to be related to each other.</span><br /><br /><span style="font-family: Arial, Helvetica, sans-serif;"><span style="color: blue;">We all know now that smoking <u>causes</u> lung cancer. This fact took years and lots of statistical work to establish. The first step was to see if there was a <u>connection</u> between smoking and lung cancer. This was done by charting how many people smoked and how many people got lung cancer. This is where we'll start, only we'll do it with more current data for purposes of illustration. Below is a table showing the incidence rates of smoking and lung cancer per 100,000 people, as tracked by the Centers for Disease Control and Prevention, from 1999 - 2007</span>.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-MVCKbhiv_CQ/TmT8eqiKhbI/AAAAAAAAADM/45TttUKw4Us/s1600/BivariateTable.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="224" src="http://3.bp.blogspot.com/-MVCKbhiv_CQ/TmT8eqiKhbI/AAAAAAAAADM/45TttUKw4Us/s320/BivariateTable.jpg" width="320" /></a></div><span style="color: blue; font-family: Arial;">Because smoking explains lung cancer incidence and not the other way around, we call the smoking rate variable the <strong>explanatory variable</strong>. The lung cancer rate variable is called the <strong>response variable</strong>. </span><br /><br /><span style="color: blue; font-family: Arial;">Looking over this table, we can see that smoking rates -- and lung cancer rates -- have both decreased during these 9 years. To illustrate how we deal with two variables at a time, let's continue with this data. The first thing to do is to plot the data on an xy-grid, one point per year. For 1999, for example, we plot the point (23.3, 93.5). Continuing with all the points, we get a <strong>scatterplot</strong>:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-7YqvLMJyQpg/TmT-ebCrl8I/AAAAAAAAADU/-YZnYAvUUSk/s1600/Scatterplot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="186" src="http://2.bp.blogspot.com/-7YqvLMJyQpg/TmT-ebCrl8I/AAAAAAAAADU/-YZnYAvUUSk/s320/Scatterplot.jpg" width="320" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Note that the explanatory variable is on the x-axis, while the response variable is on the y-axis.</span><br /><br /><span style="color: blue; font-family: Arial;">When you view a scatterplot, you are looking for a straight-line, or <strong>linear</strong>, trend. What I do is sketch the narrowest possible oval around the dots. The narrower the oval, the stronger the linear relationship between the variables. As you can see below, the relationship between smoking and lung cancer is quite strong because the oval is skinny:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-s_tCujdPWpg/TnYK9NRUSoI/AAAAAAAAAEM/uCcSSlJBZHE/s1600/ScatterOval.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="184" src="http://1.bp.blogspot.com/-s_tCujdPWpg/TnYK9NRUSoI/AAAAAAAAAEM/uCcSSlJBZHE/s320/ScatterOval.jpg" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><span style="color: blue; font-family: Arial;">If our oval had looked more like a circle, we would conclude that there is no relationship. A fatter oval that has a discernable upward or downward direction would indicate a weak association.</span><br /><br /><span style="color: blue; font-family: Arial;">This upward-reaching oval also tells us that the association is positive; that is, that as one variable increases, so does the other one. A downward-reaching oval would indicate a negative association, which means that as one variable increases, the other decreases.</span><br /><br /><span style="color: blue; font-family: Arial;">When describing an association between two quantitative variables, we address <strong>form</strong>, <strong>strength</strong>, and <strong>direction</strong>. The form is <u>linear</u>, strength is <u>strong</u>, and direction is <u>positive</u>. So we would say, "The association between smoking and lung cancer between 1999 and 2007 is strong, positive, and linear."</span><br /><br /><span style="color: blue; font-family: Arial;">Just so you know, there are other forms of association between two quantitative variables: quadratic (U-shaped), exponential, logarithmic, etc. But we limit ourselves for now to linear associations.</span><br /><br /><span style="color: blue; font-family: Arial;">Another *really* important thing to remember is that just because there is a linear association between two quantitative variables, it doesn't necessarily mean that one variable <u>causes</u> the other. We know that in the case of smoking and lung cancer, there <u>is</u> a cause-effect relationship, but this was established by several controlled statistical experiments. Seeing the linear association was only the <u>catalyst</u> for further study...it wasn't the culmination!</span><br /><br /><span style="color: blue; font-family: Arial;">In the next post, we'll continue with this same example to develop some of the finer points of analyzing associations between two quantitative variables. For now, here is a good term to know: Another way to say the end of that previous sentence is "...analyzing associations for <strong>bivariate</strong> data." Bivariate simply means "two variables."</span>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-38228747580709976922011-09-05T07:08:00.000-07:002011-09-05T07:08:59.404-07:00The Standard Normal Model: Standardizing Scores<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">For the purposes of this post, we will refer to the data values in a data set as "scores." </span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">In the last post, we used an example N(14, 2) to illustrate the 68-95-99.7 Rule, which stands for the various percents of scores lying within 1, 2, and 3 standard deviations from the mean. We can generalize the diagram we used to represent N(0, 1), where 0 is the mean and 1 is the standard deviation. This makes the model easier to apply, because the units we're most accustomed to seeing -- -1, 0,1,2, and so on -- appear as standard deviation units. Take a look:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-LwmpqIyUGs8/TlkqGdXptxI/AAAAAAAAACs/Nwajv2nlOqo/s1600/StdNormal.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-LwmpqIyUGs8/TlkqGdXptxI/AAAAAAAAACs/Nwajv2nlOqo/s320/StdNormal.jpg" width="287" /></a></div><span style="color: blue; font-family: Arial;">We call N(0,1) the <strong><span style="font-size: large;">Standard Normal model</span>. </strong>We now can use the number line to locate points that are any number of standard deviations from the mean...even fractional numbers.</span><br /><br /><span style="color: blue; font-family: Arial;">In any Normal model, we're going to want to see what how many standard deviations a particular score in the data set is from its mean. We can do this for any score, and it has to do with converting a "raw" score (a score from our data) to a "standardized" score (a score from the Standard Normal model. How do we do this? There's a formula, and it's really an easy one:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-CgxQ8rjf4Qw/TlkwHJ04pgI/AAAAAAAAACw/npC40VY0vW4/s1600/z-score.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-CgxQ8rjf4Qw/TlkwHJ04pgI/AAAAAAAAACw/npC40VY0vW4/s1600/z-score.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">...where </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> <span style="color: black; font-size: large;">X</span> stands for the score you're trying to convert,</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-EETGpnh7FRg/Tlk2SbnV95I/AAAAAAAAAC4/gPnS__nSHns/s1600/x-bar.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-EETGpnh7FRg/Tlk2SbnV95I/AAAAAAAAAC4/gPnS__nSHns/s1600/x-bar.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> stands for the mean, and </span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-6vsyCY6_sN8/Tlk25Bk9x9I/AAAAAAAAAC8/3sd1luKSDII/s1600/Sx.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-6vsyCY6_sN8/Tlk25Bk9x9I/AAAAAAAAAC8/3sd1luKSDII/s1600/Sx.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">stands for the standard deviation.</span><br /><br /><span style="color: blue; font-family: Arial;">Suppose we're back in the Normal model N(14, 2) and we want to see how many standard deviations a score of 15 is from the mean. We would subtract our mean from 15, then divide by the standard deviation, 2. That is: (15 - 14) / 2 = 0.5. This mean that our score of 15 is 0.5 standard deviations from the mean. 0.5 is the score on the Standard Normal model that represents our score from N(14, 2). We call 0.5 our standardized score, also known as a z-score. Z-scores tell us how many standard deviations a given "raw" score is from the mean.</span><br /><br /><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): In the Normal model N(50, 4), standardize a score of 55.</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): Find the z-score using the formula: </span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-CgxQ8rjf4Qw/TlkwHJ04pgI/AAAAAAAAACw/npC40VY0vW4/s1600/z-score.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-CgxQ8rjf4Qw/TlkwHJ04pgI/AAAAAAAAACw/npC40VY0vW4/s1600/z-score.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">z = (55 - 50) / 4 = 5 / 4 = 1.25. The score is 1.25 standard deviations above the mean.</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(<u>Self-Test</u>): In the Normal model N(50, 4), find the z-score for 42.</span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(<u>Answer</u>): z-scores can be negative, too. in this problem, 42 is less than the mean. So it will lie to the left of 50, and its z-score will be negative. z = (42 - 50) / 4 = -8 / 4 = -2. The score is 2 standard deviations below the mean.</span><br /><br /><span style="color: blue; font-family: Arial;">Why would we want to standardize our scores? There are actually two reasons.</span><br /><br /><span style="color: blue; font-family: Arial;"><u>1. It can help us see how unusual a score might be. </u></span><br /><br /><span style="color: blue; font-family: Arial;">How? Well, the percents we have been talking about can also be thought of as probabilities. For example, the probability that a score is greater than the mean is 50%, the same as the probability that a score is less than the mean. In the last post, we marked off regions and computed percents. In statistics, we consider any z-score of 3 or more, or -3 or less, as unusual, because (as we saw in the last post) only 0.15% of scores are in each of those regions. In other words, the probability of seeing a score in one of these regions is at most 0.15%, less than even a quarter-percent. That's unusual.</span><br /><br /><span style="color: blue; font-family: Arial;"><u>2. It can allow us to compare apples to oranges.</u></span><br /><br /><span style="color: blue; font-family: Arial;">How? Well, suppose you have just gotten back 2 tests you took: one in algebra and one in earth science. Suppose further that both score distributions follow a [different] Normal model. The algebra test's scores follow N(80, 5) and the earth science test's scores follow N(85, 8). Now imagine that you got a 90 on the algebra test and a 93 on the earth science test. You can easily see that percentage-wise, your score on the earth science test is higher than your algebra score. But relative to the distributions, on which test did you perform better?</span><br /><br /><span style="color: blue; font-family: Arial;">To find out, figure out the z-score for each test score. For your algebra test, your z-score is (90 - 80) / 5 = 10 / 5 = 2 (2 standard deviations above the mean). For your earth science test, your z-score is (93 - 85) / 8 = 8 / 8 = 1 (1 standard deviation above the mean. Relatively speaking, your performance was better on the algebra test; that is, your score was more exceptional. Think of the probabilities. The probability of a score that's 2 standard deviations or more above the mean is 2.5%, whereas a score that's 1 standard deviation or more above the mean is 16%. Get the idea?</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(<u>Self-Test</u>) Suppose Tom's algebra test score was 86 and his earth science test score was 86. In which test did Tom perform better, given that the test scores follow the Normal models we used above?</span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(<u>Answer</u>): Tom's z-score on the algebra test was z = (86 - 80) / 5 = 6/5 = 1.20. His z-score on the earth science test was z = (86 - 85) / 8 = 1/8 = 0.125. Because Tom's z-score on his algebra test (1.20) is higher than his earth science test z-score (0.125), his algebra performance was better than his earth science performance.</span><br /><br /><span style="color: blue; font-family: Arial;">One more thing...let's use the z-score formula to go backwards: to convert a standardized score back to a raw score. Suppose, Nancy earned a score on the algebra test that was 1.6 standard deviations below the mean. (In other words, her z-score was 1.6.) What actual percentage score would that represent for her, assuming N(80, 5)? In this case, we would start with the z-score formula and fill in what we know. Then, using algebra (!) we would solve for the "raw" score...</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-BbrZ2C3BcHw/Tlk1HFiSUGI/AAAAAAAAAC0/WYgrF-GeH9w/s1600/unstandardize.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-BbrZ2C3BcHw/Tlk1HFiSUGI/AAAAAAAAAC0/WYgrF-GeH9w/s1600/unstandardize.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Nancy's algebra test score was 72.</span><br /><span style="color: blue; font-family: Arial;"></span><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">In the next post, we'll expand on the probability side of the Standard Normal model.</span>Judy Cramernoreply@blogger.com4tag:blogger.com,1999:blog-464276035979449595.post-69415951963293554612011-08-31T15:07:00.000-07:002011-08-31T15:07:34.715-07:00More About the Normal Distribution: the 68-95-99.7 Rule<span style="font-family: Arial, Helvetica, sans-serif;"><span style="color: blue;">In the last post, we covered the Normal distribution as it relates to the standard deviation. We said that the Normal distribution is really a family of unimodal, symmetric distributions that differ only by their means and standard deviations. Now is a good time to introduce a new term:</span> <span style="color: blue;"><strong><span style="font-size: large;">Parameter</span></strong>. When dealing with perfect-world models like the Normal model, their major measures -- in this case, their mean and standard deviation -- are called parameters. The mean (denoted by the Greek letter "mu" (pronounced "mew"), <span style="font-family: "Calibri","sans-serif"; font-size: 16pt; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;">µ, <span style="font-family: Arial, Helvetica, sans-serif; font-size: small;">and the standard deviation is denoted by the Greek letter sigma, </span>σ<span style="font-family: Arial, Helvetica, sans-serif; font-size: small;">. We can refer to a particular normal model by identifying <span style="font-family: "Calibri","sans-serif"; font-size: 16pt; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;">µ <span style="font-family: Arial; font-size: small;">and </span>σ <span style="font-family: Arial, Helvetica, sans-serif; font-size: small;">and using the letter N, for "Normal:" <span style="font-family: "Calibri","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><span style="font-size: large;">N(µ, σ). </span><span style="font-family: Arial, Helvetica, sans-serif; font-size: small;">For example, if I want to describe a Normal model whose mean is 14 and whose standard deviation is 2, I use the notation N(14, 2).</span></span></span></span></span></span></span></span><br /><br /><span style="font-family: "Calibri","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><span style="font-size: small;"><span style="font-family: "Calibri","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: small;"><span style="font-family: "Calibri","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><span style="color: blue;"><span style="font-family: Arial, Helvetica, sans-serif;">Every Normal model has some pretty interesting properties, which we will now cover. Take the above model N(14,2).</span> <span style="font-family: Arial, Helvetica, sans-serif;">We'll draw it on a number line centered at 14, with units of 2 marked off in either direction. Each unit of 2 is one standard deviation in length. It would look like this...</span></span></span></span></span></span></span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-Re1ZKuB_ubc/TlkP1pTp-vI/AAAAAAAAACQ/wa9_oQkYlyA/s1600/Normal2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-Re1ZKuB_ubc/TlkP1pTp-vI/AAAAAAAAACQ/wa9_oQkYlyA/s1600/Normal2.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Now let's mark off the area that's between 12 and 16; that is, the area that's within one standard deviation of the mean. In a Normal model, this region will contain 68% of the data values:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-xHdAcmKOM9M/TlkRZR2ERBI/AAAAAAAAACU/V7bDfgKFdHQ/s1600/Normal3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-xHdAcmKOM9M/TlkRZR2ERBI/AAAAAAAAACU/V7bDfgKFdHQ/s1600/Normal3.jpg" /></a></div><span style="color: blue; font-family: Arial;">If you've ever heard of "grading on a curve," it's based on Normal models. Scores within one standard deviation of the mean would generally be considered in the "C" grade range.</span><br /><br /><span style="color: blue; font-family: Arial;">Now, if you consider the region that lies within two standard deviations from the mean; that is, between 10 and 18 in this model, this area would encompass 95% of all the data values in the data set:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-laEX3hF-guI/TlkTpNoNR6I/AAAAAAAAACc/zxxd5y1_IvY/s1600/Normal4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-laEX3hF-guI/TlkTpNoNR6I/AAAAAAAAACc/zxxd5y1_IvY/s1600/Normal4.jpg" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><span style="color: blue; font-family: Arial;">From a grading curve standpoint, 95% of the values would be Bs or Cs. Finally, if you mark off the area within 3 standard deviations from the mean, this region will contain about 99.7% of the values in the data set. These extremities would be the As and Fs in our grading curve interpretation.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-5neST4SnclU/TlkWFY00P2I/AAAAAAAAACg/9OOLOQAPLZM/s1600/normal5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-5neST4SnclU/TlkWFY00P2I/AAAAAAAAACg/9OOLOQAPLZM/s1600/normal5.jpg" /></a></div><span style="color: blue; font-family: Arial;">In statistics, this percentage breakdown is called the "Empirical Rule," or the "68-95-99.7 Rule."</span><br /><br /><span style="color: blue; font-family: Arial;">What about the regions up to 8 and beyond 20? These areas account for the remaining 0.3% of the data values? That would make 0.15% on each side.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-yrfzWSC9mew/TlkYc3cD0dI/AAAAAAAAACk/m2pxnFuv4aM/s1600/Normal6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-yrfzWSC9mew/TlkYc3cD0dI/AAAAAAAAACk/m2pxnFuv4aM/s320/Normal6.jpg" width="308" /></a></div><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): What percent of data values lie between 12 and 14 in N(14,2) above?</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): If 68% represents the full area between 12 and 16 and given that the Normal model is perfectly symmetric, there must be half of 68%, or 34% between 12 and 14. Likewise, there would be 34% between 14 and 16.</span><br /><br /><span style="font-family: Arial;"><span style="color: blue;"><u>(Self-Test)</u>: What percent of data values lie between 16 and 18 in N(14,2) above?</span></span><br /><span style="font-family: Arial;"><span style="color: blue;"><u>(Answer)</u>: Subtract 95% minus 68% = 27%. This represents the number of data values between 10 and 12, and between 16 and 18 both. Divide by 2 and you get 13.5% in each of these regions.</span></span><br /><br /><span style="color: blue; font-family: Arial;">If you do similar operations, the various areas break down like the following:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-ED_8mpJ-z2M/Tlkl3FiLeJI/AAAAAAAAACo/dDfhioiL2aw/s1600/normal7.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://1.bp.blogspot.com/-ED_8mpJ-z2M/Tlkl3FiLeJI/AAAAAAAAACo/dDfhioiL2aw/s320/normal7.jpg" width="299" /></a></div><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): In any Normal model what percent of data values are greater than 2 standard deviations from the mean?</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): Using the above model, the question is asking for the percent of data values that are more than 18. You would add the 2.35% and the 0.15% to get the answer: 2.50%.</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">This is all well and good if you're looking at areas that involve a whole number of standard deviations, but what about all the in-between numbers? For example, what if you wanted to know what percent of data values are within 1.5 standard deviations of the mean? This will be the subject of a future post.</span><br /><br /><br /><br /><br /><br />Judy Cramernoreply@blogger.com3tag:blogger.com,1999:blog-464276035979449595.post-72274333576499853222011-08-27T07:37:00.000-07:002011-08-27T07:37:53.670-07:00More About Standard Deviation: The Normal Distribution<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">I've haven't said a lot about Standard Deviation up to now: not much more than the fact that it's a measure of <u>spread</u> for a symmetric dataset. However, we can also think of the standard deviation as a unit of measure of relative distance from the mean in a unimodal, symmetric distribution. </span><br /><br /><span style="color: blue; font-family: Arial;">Suppose you had a perfectly symmetric, unimodal distribution. It would look like the well-known bell curve. Of course, in the real world, nothing is perfect. But in statistics, we talk about ideal distributions, known as "models." Real-life datasets can only approximate the ideal model...but we can apply many of the traits of models to them. </span><br /><br /><span style="color: blue; font-family: Arial;">So let's talk about a perfectly symmetric, bell-shaped distribution for a bit. We call this model a Normal distribution, or Normal model. Because we're dealing with perfection, the mean and median are at the same point. In fact, there are an infinite number of Normal distributions with a particular mean. They only differ in width. Below are some examples of Normal models.</span><a href="http://2.bp.blogspot.com/-YxchPWzwzgw/TlgTPbvyRNI/AAAAAAAAACA/9z2WbD5Dvf8/s1600/Normal.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="187" src="http://2.bp.blogspot.com/-YxchPWzwzgw/TlgTPbvyRNI/AAAAAAAAACA/9z2WbD5Dvf8/s640/Normal.jpg" width="640" /></a><br /><span style="color: blue; font-family: Arial;">Notice that their widths differ. Another word for "width" is "spread"...which brings us back to Standard Deviation! Take a look at the curves above. In the center section, the shape looks like an upside down bowl, whereas the outer "legs" look like part of a right-side-up bowl. Now imagine the point at which the right-side-up parts meet the upside-down part. Look below for the two blue dots in the diagram. (P.S. They are called "points of inflection," in case you were wondering.)</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-L2R8uMAKuP8/TlgYHUenG5I/AAAAAAAAACE/78uHR02x_OY/s1600/Normal1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-L2R8uMAKuP8/TlgYHUenG5I/AAAAAAAAACE/78uHR02x_OY/s1600/Normal1.jpg" /></a></div><span style="color: blue; font-family: Arial;">As shown above, if a line is drawn down the center, the distance from that line to a blue point is the length of one standard deviation. Can you see which lengths of standard deviations in the earlier examples are larger? Smaller?</span><br /><br /><span style="font-family: Arial;"><span style="color: blue;">So, there are two measures that define how a particular Normal model will look: the mean and the standard deviation.</span> </span><br /><br /><span style="color: blue; font-family: Arial;">I'd be remiss if I didn't tell you that there is a formula for numerically finding the standard deviation. Luckily, there's a lot of technology out there that automatically computes this for you. (I showed you how to do this in MS-Excel in an earlier post.) </span><br /><br /><span style="color: blue; font-family: Arial;">Suppose you have a list of "n" data values, and when you look at a histogram of these values, you see that the distribution is unimodal and roughly symmetric. If we call the values x1, x2, x3, etc. We compute the mean (average, remember?) and note it as x with a bar over it. Then the formula is:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-rAXlX4y6PGA/Tlj7zSWPLtI/AAAAAAAAACM/a6IqeKT7nj4/s1600/StdevFormula.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-rAXlX4y6PGA/Tlj7zSWPLtI/AAAAAAAAACM/a6IqeKT7nj4/s1600/StdevFormula.jpg" /></a></div><div class="separator" style="clear: both; text-align: center;"><span style="color: blue;"></span></div><span style="font-family: Arial;"><span style="color: blue;">What does the <span style="font-family: Arial, Helvetica, sans-serif;">∑</span><span style="font-family: Times New Roman;"> </span></span><span style="font-family: Arial, Helvetica, sans-serif;"><span style="color: blue;">mean? Let's take the formula apart. First, you are finding the difference between each data value and the mean of the whole dataset. You're squaring it to make sure you're dealing only with positive values. The <span style="font-family: Arial, Helvetica, sans-serif;">∑ means you should add up all those positive squared answers, one for each value in your dataset. Once you have the sum, you divide by</span><span style="font-family: Times New Roman;"> </span></span></span></span><br /><span style="font-family: Arial;"><span style="font-family: Arial, Helvetica, sans-serif;"></span></span><br /><span style="font-family: Arial;"><span style="font-family: Arial, Helvetica, sans-serif;"><div class="MsoNormal" style="margin: 0in 0in 10pt; tab-stops: 142.5pt;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(n-1), which gives you an average of all the squared differences from the mean. This measure (before taking the square root) is called the <strong><span style="font-size: large;">variance</span></strong>. When you take the square root, you have the value of the <strong><span style="font-size: large;">standard deviation</span></strong>. So, you see, the standard deviation is the <u>square root of the squared differences from the mean.</u></span></div><div class="MsoNormal" style="margin: 0in 0in 10pt; tab-stops: 142.5pt;"><span style="color: blue;">Well, that's a lot to digest, so I'll continue with the properties of the Normal model in my next post.</span></div></span><span style="font-family: Times New Roman;"> </span><br /><div class="MsoNormal" style="margin: 0in 0in 10pt; tab-stops: 142.5pt;"><br /></div></span> <br /><span style="font-family: Arial;"><span style="font-family: Times New Roman;"> </span></span>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-47456899747890737502011-08-20T08:05:00.000-07:002011-08-27T11:32:09.386-07:00Analyzing Quantitative Distributions<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">This post deals only with quantitative data. </span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">When you have a quantitative dataset, it is always a good idea to look at a graphical display of it: usually a histogram or boxplot, although there are others. What you are looking to describe is the <strong><span style="font-size: large;">shape</span></strong> of the data (the subject of a recent post), the approximate <strong><span style="font-size: large;">center</span></strong> of the dataset, the <span style="font-size: large;"><strong>spread</strong> </span><span style="font-size: small;">of the values</span>, and any <strong><span style="font-size: large;">unusual features</span></strong> (such as extremely low or high values -- outliers). We'll take them one at a time.</span><br /><br /><span style="color: blue; font-family: Arial; font-size: large;"><strong><u>Shape</u></strong></span><br /><span style="color: blue; font-family: Arial;">The shape of the dataset helps us determine how to report on the other features. We went into detail about shape in a very recent post ("The Shape of a Quantitative Distribution"). If the display is basically symmetric, you will use the mean to describe the center and a measure called the Standard deviation to describe the spread. If the display is non-symmetric, you will use the median to describe the center and the interquartile range to describe the spread. Here's a handy chart to summarize this.</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-DkuOGVQ-7S4/TkrSyD5u1UI/AAAAAAAAAB0/vW0ibiDi3qA/s1600/SOCS+table.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="106" src="http://2.bp.blogspot.com/-DkuOGVQ-7S4/TkrSyD5u1UI/AAAAAAAAAB0/vW0ibiDi3qA/s640/SOCS+table.jpg" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><span style="color: blue; font-family: Arial; font-size: large;"><strong><u>Center</u></strong></span><br /><span style="color: blue; font-family: Arial;">In a symmetric distribution the mean and median are at approximately the same place; however, statisticians use the mean. In a non-symmetric distribution, the median is used because by its very definition, it is not calculated using any extreme points or points that skew the calculation. </span><br /><br /><span style="color: blue; font-family: Arial; font-size: large;"><strong><u>Spread</u></strong></span><br /><span style="color: blue; font-family: Arial;">Notice that the <strong>range</strong> (which is the difference between the largest and smallest data value in the set) is not generally used to describe the spread. In a symmetric distribution we use the standard deviation, which has a complicated formula but a simple description: the <strong><span style="font-size: large;">Standard Deviation</span> </strong>is the average squared difference between each value in the dataset and the mean of the dataset. You can find the standard deviation using a graphing calculator or a tool like MS-Excel. The symbol for Standard Deviation is </span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-qa3wv3nSnRs/Tlk36q59-kI/AAAAAAAAADA/5SOdkD86DSs/s1600/Sx.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-qa3wv3nSnRs/Tlk36q59-kI/AAAAAAAAADA/5SOdkD86DSs/s1600/Sx.jpg" /></a></div><br /><span style="color: blue; font-family: Arial;">Standard deviations are relative. By that I mean that you can't tell just by looking at its value whether it's large or small -- it depends on the values in the dataset. Standard deviations are good for comparing spreads if you have two distributions. And, we'll soon find out that the standard deviation has an extremely important use in statistics. Here's an example for finding the mean and standard deviation using MS-Excel:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-n2mdAnff9Rw/TkrdMXLFKLI/AAAAAAAAAB4/ZIjnkdVS-_c/s1600/symmetric.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="459" src="http://1.bp.blogspot.com/-n2mdAnff9Rw/TkrdMXLFKLI/AAAAAAAAAB4/ZIjnkdVS-_c/s640/symmetric.jpg" width="640" /></a></div><br /><span style="color: blue; font-family: Arial;">In a non-symmetric distribution we use the <strong><span style="font-size: large;">Interquartile Range (IQR)</span></strong> because this tells the spread of the central 50% of the data values. Like the median, the IQR isn't influenced by outliers or skewness. Here's an example of finding the median and IQR using MS-Excel:</span><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-nUiZtENL_70/Tkrdw5pkwdI/AAAAAAAAAB8/5HdbRFaDzhY/s1600/nonsymmetric.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="444" src="http://3.bp.blogspot.com/-nUiZtENL_70/Tkrdw5pkwdI/AAAAAAAAAB8/5HdbRFaDzhY/s640/nonsymmetric.jpg" width="640" /></a></div><br /><span style="color: blue; font-family: Arial; font-size: large;"><strong><u>Unusual Features</u></strong></span><br /><span style="color: blue; font-family: Arial;">As mentioned above, unusual features include extreme points (if any), also known as <strong>outliers</strong>. In an earlier post, we covered how to determine the boundaries for outliers. If a data value lies outside the boundaries, we call it an outlier. If a value isn't quite an outlier but close, it's worth mentioning in a description of a distribution. Just call it an "extreme point."</span><br /><br />Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-58122473447811209982011-08-16T13:06:00.000-07:002011-08-16T13:06:32.739-07:00The Shape of a Quantitative Distribution<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">When you graph quantitative data, you can often see some kind of shape emerge. </span><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Here are some typical block structures that illustrate some possible shapes a display might take. We call the set of values <strong>"distributions."</strong></span><br /><div style="text-align: center;"><br /><a href="http://3.bp.blogspot.com/-4H90a38ujyE/TkWd69zKTzI/AAAAAAAAABY/a6QPuPCJA9Q/s1600/shapes.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="250" src="http://3.bp.blogspot.com/-4H90a38ujyE/TkWd69zKTzI/AAAAAAAAABY/a6QPuPCJA9Q/s320/shapes.jpg" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">The first thing you need to determine is if there is any <strong><span style="font-size: large;">symmetry</span></strong> to the graph. If you were to visualize a vertical line going down the center, does each side look like a mirror image of the other? No real-life distribution will be perfectly symmetrical, but if it's close, it's worth mentioning.</span></div><div style="text-align: left;"><br /></div><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): Which three of the above graphs look symmetric?</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): B, D, and E.</span><br /><br /><span style="color: blue; font-family: Arial;">The next thing you might notice is that some graphs have peaks whereas others look pretty level. We call the peaks "<strong><span style="background-color: white;">modes</span></strong>." If there's one peak, we say the graph has a <strong><span style="font-size: large;">unimodal distribution</span></strong>. If there are two peaks, the graph has a <strong><span style="font-size: large;">bimodal distribution</span></strong>. </span><br /><br /><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): Which three of the above graphs look unimodal? Which is bimodal?</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): A, B, and C are unimodal; D is bimodal</span><span style="color: blue; font-family: Arial;">.</span><br /><br /><span style="color: blue; font-family: Arial;">FYI, a graph that's mostly level-looking, like graph E, is called <strong><span style="font-size: large;">uniform</span></strong>.</span><br /><br /><span style="color: blue; font-family: Arial;">Now take a look at distributions A and C. Do you see that each has a "tail" at one end? When a distribution is off-center (compared to graph B), we say it is skewed. The direction of the tail is the direction of the skewness. For example, graph A is <strong><span style="font-size: large;">skewed left</span></strong> because the tail is on the left side of the graph. graph C is <strong><span style="font-size: large;">skewed right</span></strong>. </span><br /><br /><span style="color: blue; font-family: Arial;">When we describe a distribution we try to describe its <strong><span style="font-size: large;">shape</span></strong>, <strong><span style="font-size: large;">center</span></strong>, and <strong><span style="font-size: large;">spread</span></strong>, plus anything <strong><span style="background-color: #cfe2f3;">unusual</span></strong> about it, such as outliers. This will be the subject of my next post.</span><br /><br /><span style="color: blue; font-family: Arial;">(<u>Self-Test</u>): Which of the distributions pictured above <em>might</em> have outliers?</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): Skewed graphs, like A and C, depending on the length of their tails. The longer the tail, the more likely there are outliers.</span>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-86499185016854512702011-08-14T10:30:00.000-07:002011-08-14T10:30:01.208-07:00Picturing Quantitative Data<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Just in case you didn't read the post "Picturing Categorical Data," here's a tiny bit of rehash: The first thing to note that different types of data are pictured in different ways. You might recall from an earlier posting that there are two types of data: </span><br /><ul><li><span style="font-family: Arial;"><span style="color: blue;"><strong>Quantitative data</strong> are numbers for which it makes sense to do arithmetic on them. For example, test scores, dollars, miles, etc. it is meaningful to, say, find their average. Quantitative data usually have labels.</span></span></li>
<span style="color: blue;"> </span>
<li><span style="font-family: Arial;"><span style="color: blue;"><strong>Categorical data</strong> are sub-levels of variables that you do not combine with arithmetic. You usually count them. Examples: eye colors (bllue, brown, green, hazel), education levels (high school, undergrad, graduate, etc.). Some categorical variables are numbers, but not ones that make sense to add; examples are zip codes, numbers on athletic jerseys, phone numbers.</span></span></li>
<span style="color: blue;"> </span></ul><span style="color: blue; font-family: Arial;">This posting deals with quantitative data only. We deal twith categorical data in an earlier post. </span><br /><span style="color: blue;"><br /></span><span style="color: blue; font-family: Arial;">There are many ways to picture quantitative data, but let me introduce two that are most used: the <strong><span style="font-size: large;">boxplot</span></strong> and the <strong><span style="font-size: large;">histogram</span></strong>. In both cases, let's use the following set of scores, which are arranged in descending order.</span><br /><span style="color: blue;"><br /></span><br /><div align="center"><table border="1" cellpadding="0" cellspacing="0" class="MsoNormalTable" style="border-collapse: collapse; border: currentColor; margin: auto auto auto -5.3pt; mso-border-alt: solid black .5pt; mso-border-insideh: .5pt solid black; mso-border-insidev: .5pt solid black; mso-padding-alt: 0in 5.4pt 0in 5.4pt;"><tbody><span style="color: blue;"> </span>
<tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;"><span style="color: blue;"> </span><td style="background-color: transparent; border: 1pt solid black; mso-border-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">98<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: black black black rgb(0, 0, 0); border-style: solid solid solid none; border-width: 1pt 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">86<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: black black black rgb(0, 0, 0); border-style: solid solid solid none; border-width: 1pt 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">81<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span>
<tr style="mso-yfti-irow: 1;"><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black; border-style: none solid solid; border-width: 0px 1pt 1pt; mso-border-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">95<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">86<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">75<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span>
<tr style="mso-yfti-irow: 2;"><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black; border-style: none solid solid; border-width: 0px 1pt 1pt; mso-border-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">92<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">86<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">75<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span>
<tr style="mso-yfti-irow: 3;"><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black; border-style: none solid solid; border-width: 0px 1pt 1pt; mso-border-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">91<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">85<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">75<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span>
<tr style="mso-yfti-irow: 4;"><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black; border-style: none solid solid; border-width: 0px 1pt 1pt; mso-border-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">91<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">82<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">74<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span>
<tr style="mso-yfti-irow: 5;"><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black; border-style: none solid solid; border-width: 0px 1pt 1pt; mso-border-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">90<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">82<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">73<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span>
<tr style="mso-yfti-irow: 6; mso-yfti-lastrow: yes;"><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black; border-style: none solid solid; border-width: 0px 1pt 1pt; mso-border-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">89<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">81<o:p></o:p></span></span></div></td><span style="color: blue;"> </span><td style="background-color: transparent; border-color: rgb(0, 0, 0) black black rgb(0, 0, 0); border-style: none solid solid none; border-width: 0px 1pt 1pt 0px; mso-border-alt: solid black .5pt; mso-border-left-alt: solid black .5pt; mso-border-top-alt: solid black .5pt; padding: 0in 5.4pt; width: 96pt;" valign="bottom" width="128"><div align="center" class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt; text-align: center;"><span style="font-family: "Tahoma","sans-serif";"><span style="color: blue;">72<o:p></o:p></span></span></div></td><span style="color: blue;"> </span></tr>
<span style="color: blue;"> </span></tbody></table></div><span style="color: blue; font-family: "Copperplate Gothic Light","sans-serif"; font-size: 12pt; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Copperplate Gothic Light"; mso-bidi-language: AR-SA; mso-fareast-font-family: "Times New Roman"; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-fareast;"><v:shapetype coordsize="21600,21600" filled="f" id="_x0000_t75" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"> </v:shapetype></span><br /><span style="font-family: "Copperplate Gothic Light","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Copperplate Gothic Light"; mso-bidi-language: AR-SA; mso-fareast-font-family: "Times New Roman"; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-fareast;"><v:shapetype coordsize="21600,21600" filled="f" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"><v:stroke joinstyle="miter"><span style="color: blue; font-family: Arial, Helvetica, sans-serif; font-size: large;"><strong>Boxplots</strong></span></v:stroke></v:shapetype></span><br /><br /><span style="font-family: "Copperplate Gothic Light","sans-serif"; font-size: 12pt; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Copperplate Gothic Light"; mso-bidi-language: AR-SA; mso-fareast-font-family: "Times New Roman"; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-fareast;"><v:shapetype coordsize="21600,21600" filled="f" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"><v:stroke joinstyle="miter"><span style="color: blue;"><span style="font-family: Arial, Helvetica, sans-serif;">(<u>Self-Test</u>): In the earlier post, entitled "Percentiles and Quartiles," we discussed the 5 number summary, consisting of which 5 measures?</span> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"> <v:f eqn="sum @0 1 0"> <v:f eqn="sum 0 0 @1"> <v:f eqn="prod @2 1 2"> <v:f eqn="prod @3 21600 pixelWidth"> <v:f eqn="prod @3 21600 pixelHeight"> <v:f eqn="sum @0 0 1"> <v:f eqn="prod @6 1 2"> <v:f eqn="prod @7 21600 pixelWidth"> <v:f eqn="sum @8 21600 0"> <v:f eqn="prod @7 21600 pixelHeight"> <v:f eqn="sum @10 21600 0"> </v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:formulas> <v:path gradientshapeok="t" o:connecttype="rect" o:extrusionok="f"> <o:lock aspectratio="t" v:ext="edit"> </o:lock></v:path></span></v:stroke></v:shapetype></span><br /><span style="font-family: "Copperplate Gothic Light","sans-serif"; font-size: 12pt; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Copperplate Gothic Light"; mso-bidi-language: AR-SA; mso-fareast-font-family: "Times New Roman"; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-fareast;"><v:shapetype coordsize="21600,21600" filled="f" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"><v:stroke joinstyle="miter"><v:path gradientshapeok="t" o:connecttype="rect" o:extrusionok="f"><o:lock aspectratio="t" v:ext="edit"></o:lock></v:path></v:stroke></v:shapetype></span><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(<u>Answer</u>): the minimum, the 1st quartile (Q1), the median, the 3rd quartile (Q3), and the maximum.</span><br /><br /><span style="color: blue; font-family: Arial;">A boxplot uses these 5 numbers, arranged on a number line according to their values. As we discussed, each of the four regions marked off by these 5 numbers contain 1/4 of the values in the data set, even if they don't have the same width.</span><br /><br /><span style="color: blue; font-family: Arial;">There are 21 numbers shown above. The minimum is 72 and the maximum is 98. Since the number are arranged numerically, we can count to the 11th number to find the median: 85.</span><br /><br /><span style="color: blue; font-family: Arial;">We now have 10 numbers on each side of the median. The 1st and 3rd quartiles are the medians of the lower 10 and upper 10 numbers, respectively. Q1 is 75, and Q3 is midway between 90 and 91. Remember: just look between the 5th and 6th number from the bottom and the top in each case.</span><br /><br /><span style="color: blue; font-family: Arial;">So, our 5-number summary is: 72 (min), 75 (Q1), 85 (median), 90.5 (Q3), and 98 (max). To make a boxplot, draw a number line that stretches from, say, 70 to 100 and place the 5 numbers in their proper places. Then draw a box around the middle 3 numbers to show the span of the middle 50% of the number set:</span><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-ers6Ca47RCU/TjW16l4R_0I/AAAAAAAAAAc/Jzqso0W-E-Q/s1600/boxplot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="59" src="http://1.bp.blogspot.com/-ers6Ca47RCU/TjW16l4R_0I/AAAAAAAAAAc/Jzqso0W-E-Q/s320/boxplot.jpg" width="320" /></a></div><span style="color: blue; font-family: Arial;">A boxplot is helpful not only to see the spread of the data set, but also to see the symmetry of the data. For example, the stretch between the minimum and the median is about the same size as the stretch between the median and the maximum, which tells us that the mean lines up pretty well with the median. (Recall that it doesn't always end up this way!) The unevenness of the outer regions (which we call the "whiskers") are uneven, telling us that the data looks more heavily weighted toward the low side of the number line. If the boxplot looked like the following, the data would be roughly symmetric:</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-pI8y69_AMCo/TjW4fy44YcI/AAAAAAAAAAg/H1zv_rL1o4M/s1600/symmetricboxplot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="color: blue;"><img border="0" height="52" src="http://1.bp.blogspot.com/-pI8y69_AMCo/TjW4fy44YcI/AAAAAAAAAAg/H1zv_rL1o4M/s320/symmetricboxplot.jpg" width="320" /></span></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><span style="font-family: Arial, Helvetica, sans-serif;"></span><div class="separator" style="clear: both; text-align: left;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Symmetry in a data set is important to recognize, because we can assume other important traits. More about this in a future post.</span></div><div style="text-align: left;"><br /><span style="color: blue;"><span style="font-family: Arial;">Now, if a data set has outliers, they are shown with a symbol; for instance, a star</span><span style="font-family: Arial;">:</span></span></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-zhG1zUxjJZs/TkgDvcPwHsI/AAAAAAAAABs/WH5o4SR54T8/s1600/outlierboxplot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-zhG1zUxjJZs/TkgDvcPwHsI/AAAAAAAAABs/WH5o4SR54T8/s1600/outlierboxplot.jpg" /></a></div><br /><span style="color: blue; font-family: Arial;">In cases like this, a new maximum that lies within the boundary for outliers is determined. As I said in a previous post, you should look for possible reasons for outliers. </span><br /><br /><span style="color: blue; font-family: Arial; font-size: large;"><strong>Histograms</strong></span><br /><br /><span style="color: blue; font-family: Arial;">A histogram tracks the number of observations (like the values in the above table) that lie within consecutive, same-width intervals on a number line. While most people do not make histograms by hand, we'll go through how it's done so you can understand what a histogram looks like. In statistics histograms are used heavily; usually to see the shape of the data distribution.</span><br /><br /><span style="color: blue; font-family: Arial;">Start by examining the data values. They range from 72 to 98, and the <strong>range</strong> (as defined in statistics) is 98 - 72 or 26. We want to divide 26 into a number of equal-sized intervals, then find out how many of the numbers in the data set lie in each interval. We will represent that number as the height of that interval's bar.</span><br /><br /><span style="color: blue; font-family: Arial;">How many intervals do we need? Well, to me, that's a "squishy" question. You want enough to be able to see a shape, but not so many that you hardly have any values in each interval. A rule of thumb that I use is between 6 and 8. Because our range is 26, I'm going to use 7 for intervals, because 7*4 is 28, which is closer to 26 than a multiple of 6 or 8. So we have the following number line. Each interval will be 4 units wide (again, because 7*4 is 28), so I've numbered the line accordingly.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-BJXQoqDMrkU/Tkf9LJnakPI/AAAAAAAAABc/cJwLOWbYwlQ/s1600/histo1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="color: blue;"><img border="0" height="144" src="http://2.bp.blogspot.com/-BJXQoqDMrkU/Tkf9LJnakPI/AAAAAAAAABc/cJwLOWbYwlQ/s320/histo1.jpg" width="320" /></span></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">(<u>Self-Test</u>): What's with the // on the number line above?</span><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): Number lines should always be to scale, but in this case a number line going from 0 to 100 probably wouldn't fit on the page. So, I use the // to indicate a break in the numbering.</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Now I need to make a small decision: Shall I count the labelled numbers as the start or the end of the interval? It doesn't matter, just as long as you're consistent. I'm going to count the numbers as the start of each interval. So, my first interval will go from 72 to 75; my second will be from 76 to 79, and so on.</span><br /><br /><span style="color: blue; font-family: Arial;">For each interval, you need to count how many values in the data set fall within that interval. For example, there are 6 numbers that are between 72 and 75, inclusive. So the height of my bar for the first interval will be 6 units high. </span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-n9pMSTS2Qtk/Tkf_gB0EgKI/AAAAAAAAABg/E6twriksyrM/s1600/histo2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="color: blue;"><img border="0" height="217" src="http://1.bp.blogspot.com/-n9pMSTS2Qtk/Tkf_gB0EgKI/AAAAAAAAABg/E6twriksyrM/s320/histo2.jpg" width="320" /></span></a></div><span style="color: blue; font-family: Arial;">Our second interval, which goes from 76 to 79, has no values. So we leave a space to show that zero values are in that interval. The next interval goes from 80 to 83, and there are four values that lie in that interval. So, the bar will be 4 units high.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-F27hFycqFWs/TkgAcuXYt7I/AAAAAAAAABk/oUIDSBPXjgs/s1600/histo3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="color: blue;"><img border="0" height="209" src="http://3.bp.blogspot.com/-F27hFycqFWs/TkgAcuXYt7I/AAAAAAAAABk/oUIDSBPXjgs/s320/histo3.jpg" width="320" /></span></a></div><span style="color: blue; font-family: Arial;">Continuing on for the rest of the intervals, here is our finished product:</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Q9nHmAzxM0k/TkgBwR64cbI/AAAAAAAAABo/afXpO5SvhSc/s1600/histo4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="204" src="http://2.bp.blogspot.com/-Q9nHmAzxM0k/TkgBwR64cbI/AAAAAAAAABo/afXpO5SvhSc/s320/histo4.jpg" width="320" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">This histogram has an interesting shape. If the dataset represents grades, we can assume that there 6 out of 21 "C" students, while twice that many are in the "B" to "A-" range. I'll be saying more about shapes of distributions in a later post. For now, suffice it to say that the shape of a distribution is an important thing to assess when you're looking at data.</span>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-79361401203157279072011-08-12T13:37:00.000-07:002011-08-12T13:37:57.526-07:00Keeping it Simple -- the Area Principle<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">If you made it through the [rather long] post entitled "Picturing Categorical Data," this next post brings out a fine point about making graphical displays. If you look through articles and newspapers, or perhaps at slides that people make in your office for presentations, you might see a tendency for folks to want to make them fancy. While this is admirable -- trying to show potentially dry information in a more splashy way -- it can end up being misleading. Read on.</span><br /><br /><span style="font-family: Arial;"><span style="color: blue;">Take a look at this pie chart, which graphs quantities that are 10%, 20%, 30%, and 40% of the whole.</span> </span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-epemwbSkZ4s/TkWL7ehit0I/AAAAAAAAABI/7rmbUF3FPvU/s1600/2Dpie.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://1.bp.blogspot.com/-epemwbSkZ4s/TkWL7ehit0I/AAAAAAAAABI/7rmbUF3FPvU/s320/2Dpie.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">Now compare it to its flashy 3-D counterpart:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-SHpw-D4fWQ4/TkWMGBY5xgI/AAAAAAAAABM/r1s7_2y5JZI/s1600/3Dpie.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://2.bp.blogspot.com/-SHpw-D4fWQ4/TkWMGBY5xgI/AAAAAAAAABM/r1s7_2y5JZI/s320/3Dpie.jpg" width="320" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">"What's the difference?" you might ask. However, I would argue that the 3-D version makes the green slice (30%) look larger than the purple (40%) slice. Can you see it? This doesn't always happen with 3-D displays, but 3-D displays are prone to this. It's something that you want to look out for, just in case.</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">When a smaller pie slice or bar (in a bar chart) looks larger than a slice or bar that represents a larger quantity, we say that the display violates the <strong><span style="font-size: large;">Area Principle</span></strong>. In the first display, each piece was proportionately sized relative to the others. </span><br /><br /><span style="color: blue; font-family: Arial;">Why make this point? When trying to communicate something graphically, the main point is to get the information across with as little potential confusion as possible -- not to impress people with fancy pictures. Statistics can be mind-boggling to many, so why not try to make things as straightforward as possible? It's the old K.I.S.S. principle. Statistics-challenged people will thank you! (As I'm sure you're thanking me for the short post!)</span><br /><br />Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-51166513643970143362011-08-10T11:03:00.000-07:002011-08-10T11:06:42.302-07:00Has the Stock Market Hit Bottom?<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Oh, if only there were a nice statistical way to answer that question! In the August 10, 2011 issue of USA Today, the article at </span><a href="http://www.usatoday.com/money/markets/2011-08-09-has-market-hit-bottom_n.htm"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">http://www.usatoday.com/money/markets/2011-08-09-has-market-hit-bottom_n.htm</span></a><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> offered a lot of insights from experts, but the authors were wise enough to know that there’s just no telling whether the stock market will rise or fall, or whether better days are to come yet. </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> </span><br /><div class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-family: Arial, Helvetica, sans-serif;"><span style="color: blue;">For anyone who’s not aware, the Dow Jones Industrial average (DJIA) plummeted 535 points on Monday, following the bad news that the US credit rating had dropped from AAA to AA+, according to Standard & Poors and probably a multitude of other events that stripped investors of their confidence. The following day, August 9, the DJIA gained back 435 points, bringing it closer to pre-plummet levels. At this writing (Wednesday, August 10, 2011 at 1:35 p.m.), the Dow is down about 357 points. <span style="mso-spacerun: yes;"> </span></span></span></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> </span><br /><div class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">The insights offered by experts as to why the market is plummeting range from emotional sell-offs in light of the US credit rating downgrade, to the state of theoverall world economy, to the lack of warm fuzzies in Congress, but no one is making a the mistake of trying to predict the future. Nothing, not even statistics, can do that reliably.</span></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> </span><br /><div class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Theoretically, even if somehow we were able to identify *all* of the factors (“variables” we call them) that make for fluctuations in the stock market, and then we found some realistic way to quantify each of the factors, we could come up with a [probably quite complicated] equation involving all of those variables that we could use to predict future behavior, but it would be just that – a prediction. You simply cannot use statistics to foretell the future. The best one can do is make a fact-based, educated guess.</span></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> </span><br /><div class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Similarly, you cannot use the past to predict the future. Most of fall into this trap pretty easily. For example, a lot of people moved their investments to Treasury-related securities, because Treasury issues have shown a steadily upward trend without the volatility of riskier investments, like stocks. In fact, in the past 10 years, inflation-protected securities (TIPS) have out-performed the Dow, the Nasdaq, and the S&P 500 significantly…at about half the risk (as rated by Vanguard). <span style="mso-spacerun: yes;"> </span>So are you rushing to your favorite investment site to move your money? If you are, then you’re using past performance to predict the future! Sometimes it works, but just as often it doesn’t. </span></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> </span><br /><div class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Has the stock market hit bottom? Read the article, and you be the judge. Just trying to set your expectations about what statistics can and cannot do!</span></div>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-80753366748365903032011-08-07T14:51:00.000-07:002011-08-07T14:51:37.746-07:00Picturing Categorical Data<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">They say a picture is worth a thousand words, and that's certainly true with statistical data! The first thing to note that different types of data are pictured in different ways. You might recall from an earlier posting that there are two types of data: </span><br /><ul><li><span style="font-family: Arial;"><span style="color: blue;"><span style="font-size: large;">Categorical data</span> are sub-levels of variables that you do not combine with arithmetic. You usually count them. Examples: eye colors (bllue, brown, green, hazel, gray), education levels (high school, undergrad, graduate, etc.). Some categorical variables are numbers, but not ones that make sense to add; examples are zip codes, numbers on athletic jerseys, and phone numbers.</span></span></li><span style="color: blue;"> </span><li><span style="font-family: Arial;"><span style="color: blue;"><span style="font-size: large;">Quantitative data</span> are numbers for which it makes sense to do arithmetic on them. For example, with test scores, dollars, miles, etc., it is meaningful to, say, find their average. Quantitative data usually have labels.</span></span></li><span style="color: blue;"> </span></ul><span style="color: blue; font-family: Arial;">In this post, we'll picture categorical data only; I'll cover quantitative data in another post.</span><span style="color: blue;"><br /></span><br /><span style="color: blue; font-family: Arial;">There are two ways to "graphically" picture categorical data: bar charts and pie charts. You see these all the time in articles. Let's use the following base data that shows eye colors in a group of 115 people:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-y7MSnJphv-g/Tj7vZ1dIyXI/AAAAAAAAAAk/-M3AQMJwqxo/s1600/eyecolortable.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-y7MSnJphv-g/Tj7vZ1dIyXI/AAAAAAAAAAk/-M3AQMJwqxo/s1600/eyecolortable.jpg" /></a></div><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"><strong><u>Bar Charts</u></strong></span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">The most straightforward type of display is a bar chart. Bar charts can portray the actual numbers...</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-y79gT0K3_uU/Tj7v3hafHnI/AAAAAAAAAAo/ZoDKKFdIaJk/s1600/eyecolorbar.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="187" src="http://1.bp.blogspot.com/-y79gT0K3_uU/Tj7v3hafHnI/AAAAAAAAAAo/ZoDKKFdIaJk/s320/eyecolorbar.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">...or they can portray percents of the whole (115)...</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Wx1KRPxJtG0/Tj7wFTiBJpI/AAAAAAAAAAs/9DkRvm5D6FE/s1600/eyecolorsidebar.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://1.bp.blogspot.com/-Wx1KRPxJtG0/Tj7wFTiBJpI/AAAAAAAAAAs/9DkRvm5D6FE/s320/eyecolorsidebar.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">Notice that the bars can be vertical or horizontal. In either case, the bars are arranged in either increasing or decreasing size. </span><br /><br /><span style="color: blue; font-family: Arial;">If you have a number of very small bars, you can put them together as a larger combined bar and label it "Other." Such bars don't necessarily need to be placed in order; they usually appear as the last bar. Suppose, for example, you want to put green, hazel, and gray together as an "Other" bar. Then your graph would look like this:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-wcA-V8OdsBY/Tj7xaTF5EDI/AAAAAAAAAAw/ewsNlVwwFEA/s1600/eyecolorother.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://3.bp.blogspot.com/-wcA-V8OdsBY/Tj7xaTF5EDI/AAAAAAAAAAw/ewsNlVwwFEA/s320/eyecolorother.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">Sometimes you will see a single bar with sections representing each category in proportion. This is called a segmented (or stacked) bar chart. They can appear with the actual counts (height of the single bar is equal to the sum of the counts), or as percentages (height of the single bar represents 100%). Here's how a segmented bar chart with percents would look. Notice that it's good to arrange the bars in decreasing order from bottom to top:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-6o9TKc-LZgo/Tj71veGWygI/AAAAAAAAAA0/bSkdhYIvJSM/s1600/eyecolorseg.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-6o9TKc-LZgo/Tj71veGWygI/AAAAAAAAAA0/bSkdhYIvJSM/s320/eyecolorseg.jpg" width="129" /></a></div><span style="color: blue; font-family: Arial;">Segmented bar charts are especially helpful in comparing the same categories in two or more different groups. Suppose we had a second group of people whose segmented bar chart of eye colors looked slightly different. We could put the bars side by side in a single display, with the eye colors in the same order.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-UVshoDRj_Ws/Tj73b695poI/AAAAAAAAAA4/mw28MGrBCAM/s1600/eyecolorseg2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://1.bp.blogspot.com/-UVshoDRj_Ws/Tj73b695poI/AAAAAAAAAA4/mw28MGrBCAM/s320/eyecolorseg2.jpg" width="217" /></a></div><br /><span style="color: blue; font-family: Arial;">It is easy to see that there are fewer brown-eyed people in Group 2, but more blue-eyed people, about the same number of green and gray eyed-people, but fewer with hazel eyes.</span><br /><br /><span style="color: blue; font-family: Arial;">One caution with two or more segmented bars: Unless each group is exactly the same size, you should use percents rather than counts. Otherwise it would be nearly impossible to compare the bars.</span><br /><br /><span style="color: blue; font-family: Arial;"><strong><u>Pie Charts</u></strong></span><br /><span style="color: blue; font-family: Arial;">A pie chart shows each category as a proportional-sized pie slice. Pie charts always use percents. So, a pie chart for our eye color data would look like this:</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-HqozfUZe3n8/Tj75HqYPQSI/AAAAAAAAAA8/cPZ6uM5WNgI/s1600/eyecolorpie.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="193" src="http://1.bp.blogspot.com/-HqozfUZe3n8/Tj75HqYPQSI/AAAAAAAAAA8/cPZ6uM5WNgI/s320/eyecolorpie.jpg" width="320" /></a></div><br /><span style="color: blue; font-family: Arial;">Notice that the pieces are arranged in order of size as you go around the pie.</span><br /><br /><span style="color: blue; font-family: Arial;"><u>(Self-Test)</u>: Suppose there's another group -- this time consisting of 140 people -- whose eye colors are as follows. Make a "Group 3" segmented bar chart to show the differences.</span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-ME8rD_F_NVg/Tj8Iu4lIuqI/AAAAAAAAABE/OYWd7j9uiPM/s1600/eyecolortable2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-ME8rD_F_NVg/Tj8Iu4lIuqI/AAAAAAAAABE/OYWd7j9uiPM/s1600/eyecolortable2.jpg" /></a></div><br /><span style="color: blue; font-family: Arial;">(<u>Answer</u>): See below.</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-In0eYSx0uRc/Tj8HUXHvkRI/AAAAAAAAABA/81iRWaproZs/s1600/eyecolorseg3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://1.bp.blogspot.com/-In0eYSx0uRc/Tj8HUXHvkRI/AAAAAAAAABA/81iRWaproZs/s320/eyecolorseg3.jpg" width="255" /></a></div>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-6521498894197112062011-07-31T14:10:00.000-07:002011-07-31T14:10:30.557-07:00The Interquartile Range and Outliers<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">In the previous post, I introduced percentiles and quartiles and said that the Interquartile Range (IQR) is found by subtracting Q3 (the third quartile) minus Q1 (the first quartile). The IQR is significant because it tells us where the middle 50% of the numbers in the data set lie. This is especially helpful to know if there are extreme values at the low and/or high end of the data set.</span><br /><br /><span style="color: blue; font-family: Arial;">Take the very short data set 40, 80, 86. 88, 90. Notice that the 40 is extremely far away from the other numbers; it is considered an extreme value. It would make the average (mean) unreasonably small with respect to the majority of the other numbers. We call 40 in this set an <strong><span style="font-size: large;">outlier</span></strong>. Outliers can occur by coincidence but just as often there is some reason behind them. Statisticians should at least try to see if there's an explanation when they run into an outlier. Then they can analyze the data set with that in mind. Sometimes it's good to do two analyses: one with the outlier, and one without it.</span><br /><br /><span style="color: blue; font-family: Arial;">How do we tell if a number is extreme enough to be an outlier? Well, believe it or not, there is a mathematical way and it's pretty straightforward. To test for outliers, we need to know what the 1st and 3rd quartiles are, so we can compute the IQR. Here are the steps:</span><br /><ol><li><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Find the IQR.</span></li><li><span style="color: blue; font-family: Arial;">Multiply the IQR by 1.5.</span></li><li><span style="color: blue; font-family: Arial;">Add the resulting number to Q3 to get an upper boundary for outliers.</span></li><li><span style="color: blue; font-family: Arial;">Subtract the same resulting number (from #2) from Q1 to get a lower boundary for outliers.</span></li><li><span style="color: blue; font-family: Arial;">If a number in the data set lies beyond either boundary, it is considered an outlier.</span></li></ol><span style="color: blue; font-family: Arial;">In the example above (40, 80, 86, 88, 100), Q1 is 80 and Q3 is 88. Going through the steps...</span><br /><br /><ol><li><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Q3 - Q1 = 88 - 80 = 8 (The IQR)</span></li><li><span style="color: blue; font-family: Arial;">IQR * 1.5 = 8 * 1.5 = 12</span></li><li><span style="color: blue; font-family: Arial;">Q3 + 12 = 88 + 12 = 100. This is the upper boundary for outliers. Since our maximum is right at 100, we have no high outliers.</span></li><li><span style="color: blue; font-family: Arial;">Q1 - 12 = 80 - 12 = 68. This is the lower boundary for outliers. Since our minimum number, 40, is less than 68, our data set has one low outlier: 40.</span></li></ol><span style="color: blue; font-family: Arial;">Note that there can be any number of outliers: one on each side, two high ones and no low ones, and so forth. If there are outliers, we note them as such, try to explain them if possible, and restate and minimum and maximum at the boundaries we found.</span><br /><br /><span style="font-family: Arial;"><span style="color: purple;"><u>Moral of the Story</u>: How do we try to explain outliers? The best way is to examine the data: where it came from, what it represents, and where and how it was collected, In other words, look at the <strong><span style="font-size: large;">context</span></strong>. An outlier could turn out to be a simple mis-keying error, or might indicate something more serious. Don't make something up! But do have a look to see if there's something obvious going on.</span></span>Judy Cramernoreply@blogger.com5tag:blogger.com,1999:blog-464276035979449595.post-82668414784463239002011-07-30T08:04:00.000-07:002011-07-31T13:22:09.482-07:00Connection or Causation?<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">I was reading an article the other day about a possible connection between mental illness and nose jobs, and thought, "Aha! A perfect possible example of crazy statistics!" But I was not only disappointed on that count, but also pleasantly surprised that the article showed valid statistical practices. What I can focus on in this post are two things:</span><br /><ol><li><span style="color: blue; font-family: Arial;">What valid statistical practices were used?</span></li><li><span style="color: blue; font-family: Arial;">How to avoid common pitfalls when reading an article that involves statistics.</span></li></ol><span style="color: blue; font-family: Arial;">First, you might want to quickly read the article:</span><br /><span style="font-family: Arial;"><a href="http://well.blogs.nytimes.com/2011/07/27/some-nose-job-patients-may-have-mental-illness/"><span style="color: blue;">http://well.blogs.nytimes.com/2011/07/27/some-nose-job-patients-may-have-mental-illness/</span></a></span><br /><span style="color: blue; font-family: Arial;">Go ahead; I'll wait.</span><br /><br /><span style="color: blue; font-family: Arial;">There are enough bad statistical articles out there, which I'm sure I'll happen upon and grab for another post, but as I said, this article is not one of them. First, the author was careful to qualify the difference between people who have valid medical reasons to seek plastic surgery on their noses and those who appear to be negatively obsessed with a perfectly normal-looking (from a plastic surgeon's standpoint) nose. The author was careful to separate the two.</span><br /><br /><span style="color: blue; font-family: Arial;">This was a "controlled" study, as statisticians say. First, there was a rather large sample (266) of patients seeking nose jobs in Belgium, who filled out a diagnostic questionnaire that was intended to uncover the condition (called Body Dysmorpphic Disorder (BDD)). The author included all of the vital information -- number of people, location of the study, duration of the study, and a link to the full journal article so that the study could be reproduced if desired. Rather than worrying that a duplicate study will refute the findings of the original study, most statisticians welcome "do-overs" of the study, because it can either strengthen their findings or point out something they might have overlooked. After all, from an ethical standpoint, most researchers are after the truth, so they try to make the specifics as transparent as possible.</span><br /><br /><span style="color: blue; font-family: Arial;">Separating those patients who had a "valid" complaint about their nose (for instance, a breathing problem) from those who showed signs of BDD was an excellent example of controlling the study. For example, if they hadn't done this and went on conducting the study, they couldn't have separated the "valid" patients from the BDD patients later. By controlling the study, they were able to determine that only 2% of the "valid" patients showed evidence of BDD while 43% of the "invalid" patients did. If everyone had been lumped together, the results might have been diluted. Researchers need to think of possible variables <em>ahead of time </em>that could muddy the waters later. The variable of "valid" versus "invalid" was one of these. We call these variables <span style="font-size: large;"><strong>lurking variables</strong></span>, because they can work under the surface, making it look like one thing (seeking a nose job) is <u>causing</u> the other (BDD)... or vice versa.</span><br /><br /><span style="color: blue; font-family: Arial;">Another interesting and admirable thing the article did was volunteer extra information that the reader might have had upon reading the article; that is, they singled out nose jobs from other plastic surgery as being notable with respect to this connection. They seem to have thought things through very well.</span><br /><br /><span style="color: blue; font-family: Arial;">Now, on to you as a reader of such articles. My only caution in this case is that you don't read any type of cause and effect situation into the article. Having BDD doesn't necessarily <u>cause</u> people to seek a nose job, although there seems to be a very strong connection between the two. Further study and repetition of the study would be needed to prove cause and effect. To its credit, the author of this article was careful not to lead you in the causation direction.</span><br /><br /><span style="color: blue; font-family: Arial;">Simply put: Just because two variables appear to be connected doesn't prove that one variable <u>causes</u> another. For those readers who recall the "old days" when people smoked cigarettes and didn't know about the dangers of lung cancer, remember how it took years to get even the mildest cautionary note placed on cigarette packs? The first such caution cited a <u>connection</u> between cigarette smoking and lung cancer. After years of further controlled study, researchers were finally able to put stronger cautions on cigarette packs: that cigarette smoking <u>causes</u> lung cancer.</span><br /><br /><span style="font-family: Arial;"><span style="color: purple;"><u>Moral of the Story</u>: This was an example of an article in which the author was very careful not to imply cause and effect. Other articles aren't so clear. As you read articles that involve numbers and statistics, be aware of this and don't jump to causation conclusions!</span></span><br /><br /><span style="font-family: Arial;"><span style="color: blue;"><u>Self-Test</u>: Name a lurking variable in this statement: "The more firefighters sent to a fire, the more damage is done to the structure."</span></span><br /><br /><span style="color: blue; font-family: Arial;"><u>Answer</u>: Size of the fire is the lurking variable. It influences both variables: the amount of damage done, and the number of firefighters sent to the scene.</span><br /><br /><span style="color: blue; font-family: Arial; font-size: x-small;">(Credit: <u>Stats: Modeling the World</u>, Bock, Velleman, deVeaux. Thanks!</span>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-39974970825690650862011-07-29T06:10:00.000-07:002011-07-29T06:12:59.583-07:00Percentiles and Quartiles<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Suppose you just received your GMAT scores and found that your score was at the 85th percentile. A <strong><span style="font-size: large;">percentile</span></strong> is a number that tells you what percent of scores are <u>below</u> yours. So, in this example, being at the 85th percentile means that 85% of all the GMAT scores were below yours. Good job!</span><br /><br /><span style="color: blue; font-family: Arial;">Suppose you take your baby in for a checkup with her pediatrician, who tells you that your baby's weight is at the 70th percentile. This means that 70% of all babies weigh less than yours. Not too hard, right?</span><br /><br /><span style="color: blue; font-family: Arial;">"Percentile" contains the word percent, which is a number out of 100. This means that the list of the numbers are batched into one hundred groups: each group contains 1/100 of the values. </span><br /><br /><span style="color: blue;"><span style="font-family: Arial;">There are other "-iles" that use numbers other than 100. For example, think of the word <strong>decile</strong>. This implies, as with decimals, the number 10. Regarding baby weights, the list of all weights is batched into ten groups of weights. </span><span style="font-family: Arial;">Your baby (who was at the 70th percentile) would be at the 7th decile.</span></span><br /><br /><span style="color: blue; font-family: Arial;">In statistics, we work a lot with <strong><span style="font-size: large;">quartiles</span></strong>. This implies the number 4 (as in quarter, quartet, etc.). Inagine the numbers grouped into 4 batches (in numerical order, of course).Each quartile is the number marking off each batch. The first quartile marks the end of the lowest 25% of the numbers in the set, the second quartile marks the end of the 2nd 25% of the numbers in the set, and so on. By the way, the 2nd quartile is also known as the <span style="background-color: #cfe2f3;">median</span>, which appeared in an earlier post. The median marks the point at which half the numbers are below and half are above.</span><br /><br /><span style="color: blue; font-family: Arial;">An example using real data might help out. Think of the following scores, which I've arranged in increasing order: </span><br /><span style="color: blue; font-family: Arial;"> 52, 54, 61, 63, 68, 68, 72, 75, 82, 82, 84, 93. </span><br /><br /><span style="color: blue; font-family: Arial;">There are 12 scores, so divide them equally into 4 groups:</span><span style="color: blue; font-family: Arial;"> </span><br /><span style="color: blue; font-family: Arial;"> 52 54 61 / 63 68 68 / 72 75 82 / 82 84 93. </span><br /><br /><span style="color: blue; font-family: Arial;">The numbers dividing each of these groups -- 62, 70, 82 -- are the 1st, 2nd (median), and 3rd quartiles. (The 4th quartile is also known as the maximum.)</span><br /><br /><span style="color: blue; font-family: Arial;">Or, you could first find the median of the whole list, as illustrated in my last post. That's the 2nd quartile, also known as Q2. To find Q1, the first quartile, find the median of the lower half (i.e., from the minimum to the median). To find Q3, find the median of the upper half of the scores (i.e., from median to maximum).</span><br /><br /><span style="color: blue; font-family: Arial;">Just one more thing: Even though each quartile contains the same <u>number</u> of scores doesn't mean that the <em>span </em>of the scores in each quartile are equal if you were to mark each on a number line. Look at the 12 scores above. The lowest fourth ranges from 52 to 61: a span of 9 units. The next fourth is only 5 units wide (63 to 68), The one after that:is 10 units wide: from 72 to 82. The last group ranges from 82 to 93, 11 units wide. So if we were to portray the quartiles on a straight line they would appear at the slash (/) marks:</span><br /><br /><span style="color: blue; font-family: Arial;"> - - - - - - - - - / - - - - - / - - - - - - - - - - - / - - - - - - - - - - - </span><br /><span style="color: blue; font-size: x-small;">52 62 70 82 93</span><br /><span style="color: blue; font-size: x-small;"> min Q1 median Q3 max</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">If I surround Q1, the median, and Q3 with a box, I get a picture similar to what's called a boxplot in statistics:</span><br /><span style="color: blue; font-family: Arial;"> ___________________</span><br /><span style="font-family: Arial;"><span style="color: blue;"><span style="font-family: Arial;"> |- - - - - - - - |</span><span style="font-family: Arial;"> - - - - - | - - - - - - - - - - - | - - - - - - - - - - -| </span></span><br /><span style="color: blue;"> |______|_____________| </span><br /><span style="color: blue; font-size: x-small;"> 52 62 70 82 93</span><br /><span style="color: blue; font-size: x-small;"> min Q1 median Q3 max</span></span><span style="color: blue;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"> </span></span><br /><span style="color: blue;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"> (Excuse the crudeness of the diagram -- I confess I have a lot to learn about HTML!)</span> </span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">By the way, the minimum, Q1, the median, Q3, and the maximum are also known as the </span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><span style="color: blue;"><strong>5-Number Summary</strong> and are considered a great way to describe any set of numerical (quantitative) data.</span></span> <span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Here's another measure that is important: the <strong>Interquartile Range</strong> (or <strong>IQR</strong>), which is found by subtracting Q3 - Q1. The IQR gives the span of the middle 50% of the data.set. This comes in handy when we're trying to describe a set of numbers that have extreme values (<span style="background-color: #cfe2f3;">outliers</span>).</span><br /><br /><span style="color: purple; font-family: Arial, Helvetica, sans-serif;"><u>Moral of the Story</u>: In a boxplot, each of the 4 regions bounded by the 5 numbers contain 25% of the numbers in the data set. This does not mean that each region has the same width.</span><br /><br /><span style="font-family: Arial;"><span style="color: blue;"><u>Self-Test</u>: The median represents what percentile?</span></span><br /><br /><span style="color: blue; font-family: Arial;"><u>Answer</u>: Because the median marks the point at which half the numbers are below and half are above, the median represents the 50th percentile.</span>Judy Cramernoreply@blogger.com1tag:blogger.com,1999:blog-464276035979449595.post-47092793086660527362011-07-27T12:04:00.000-07:002011-08-27T11:33:31.691-07:00The Center: Mean, Median, and Mode<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Let's begin with what might be familiar territory: how people describe a list of numbers (<span style="background-color: #cfe2f3;">data</span>!) using a single central measure. There are three of these numbers -- <strong><span style="font-size: large;">mean</span>, <span style="font-size: large;">median</span>,</strong> and <strong><span style="font-size: large;">mode</span></strong> -- and which one is best heavily depends on the data you're describing. Let's go over how to determine each of these measures, and discuss the pros and cons of each.</span><br /><br /><span style="color: blue;"><span style="font-family: Arial;">Consider the following very short list of test grades one of my students received last semester: </span><span style="font-family: Arial;">74, 84, 88, 71, and 88.</span></span><br /><br /><ul><li><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">To find the <strong><span style="font-size: large;">mean</span></strong> (also known as the <span style="background-color: #cfe2f3;">average</span>): just add up all of the numbers and divide by how many numbers are in the set. That is: (74+84+88+71+88) divided by 5 = 405 / 5 = 80. The symbol we'll use for the mean is</span></li>
</ul><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-7Z_YSZ-dRpE/Tlk4WvmeUeI/AAAAAAAAADE/3R3VHBf7Vng/s1600/x-bar.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-7Z_YSZ-dRpE/Tlk4WvmeUeI/AAAAAAAAADE/3R3VHBf7Vng/s1600/x-bar.jpg" /></a></div><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> The mean is good to use when there are no extreme values (numbers that lie far </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> outside the span of the other numbers. There are none in this set of numbers, so </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> we say there are no <strong><span style="background-color: #cfe2f3;">outliers</span> </strong>and so the mean is just fine to use. </span><br /><ul><li><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">To find the <strong><span style="font-size: large;">median</span></strong> (also known as the <span style="background-color: #cfe2f3;">midpoint</span>), arrange the grades in numerical order: 71 ,74, 84, 88, 88. Note that we list duplicates as many times as they occur. Once the numbers are arranged, find the number that's in the middle of the list. In this short list, it's 84. </span></li>
</ul><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> Now, what if there is no middle number? For example, suppose we add a sixth </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> score: 94. We now have the list, in order: 71, 74, 84, 88, 88, 94. When we look </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> for the middle,there isn't a single score but two: 84 and 88. In this case, find the </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> <span style="background-color: #cfe2f3;">average</span> of these two numbers: (84+88}/2 = 86.</span><br /><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> The median is good to use almost anytime, but is especially important to use when </span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> there are outliers. To see why the median is more accurate than the mean as a </span><br /><span style="color: blue;"><span style="font-family: Arial, Helvetica, sans-serif;"> central </span><span style="font-family: Arial, Helvetica, sans-serif;">measure when there are outliers, think about the salaries of a <u>very</u> small </span></span><br /><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"> company, from the line workers to the CEO:</span><br /><br /><span style="color: blue; font-family: Arial;"> Line worker 1: $ 28.000</span><br /><span style="color: blue; font-family: Arial;"> Line worker 2: $ 32,500</span><br /><span style="color: blue; font-family: Arial;"> Line worker 3: $ 33,100</span><br /><span style="color: blue; font-family: Arial;"> Supervisor: $ 45,000</span><br /><span style="font-family: Arial;"><span style="color: blue;"> <span style="font-family: Arial;">Marketing person: $ 62,300</span></span></span><br /><span style="font-family: Arial;"><span style="color: blue;"> Sales person: $ 70,000</span><br /><span style="color: blue;"> CEO: $175,000</span><br /><br /><span style="color: blue;"> Compare the mean ($ 63,700) to the median $45,000. The mean isn't realistic </span><br /><span style="color: blue;"> because 5 out of 7 of the workers are making less! This is because the outlier, </span><br /><span style="color: blue;"> $175,000, inflates the calculation of the mean. On the other hand, the median is </span><br /><span style="color: blue;"> much more reflective of the central salary: 3 employees make more, 3 make less.</span><br /></span><br /><span style="font-family: Arial;"></span><br /><span style="font-family: Arial;"></span><br /><span style="font-family: Arial;"></span><br /><span style="font-family: Arial;"><ul><li><span style="color: blue;">Finding the <strong><span style="font-size: large;">mode</span></strong> is easy, but it exists only if there are duplicates in the list. Simply find <span style="background-color: white;">the number that occurs the most. In our original list of 5 test scores, 88 occurs twice, so the mode is 88. In our salary example directly above, there is no mode because no salary occurs more than once. If one number occurs twice and another number occurs four times, the latter is the mode because it occurs the most.</span></span></li>
</ul><span style="background-color: white; color: blue;"> The mode is probably the least useful as a measure of center. The only time it </span><br /><span style="background-color: white; color: blue;"> makes sense is if there are many occurrences of the same number, compared to </span><br /><span style="background-color: white; color: blue;"> the number of other values. Example: 57, 66, 75, 75, 75, 75, 75, 75, 82.</span><br /><br /><span style="color: purple;"><u>Moral of the Story</u>: When you see the terms "mean" and "median" used in articles, do not assume that the writer is always using the right term. If you see the data, you can check this. Otherwise, the author might be confusing one measure for the other. Not everyone understands the difference, but now (hopefully) you do!</span></span><br /><br /><span style="font-family: Arial;"><span style="background-color: white;"><span style="color: blue;"><u>Self quiz</u>: What are the measures of center (mean, median, and mode) for the following list of 8 student scores? 43, 77, 66, 73. 85, 75, 92, 81. </span></span></span><br /><br /><span style="background-color: white; color: blue; font-family: Arial;">(<em><u>Answer</u></em>): Mean = 74; Median = 76; Mode = none. Which is the better measure? Technically, the median would be better because the low score of 43 is dragging the mean down a bit. However, it's pretty much a wash since the mean and median are so close to each other. </span>Judy Cramernoreply@blogger.com0tag:blogger.com,1999:blog-464276035979449595.post-73804272515467546962011-07-26T13:52:00.000-07:002011-07-27T12:03:13.023-07:00What is Statistics?<span style="color: blue; font-family: Arial, Helvetica, sans-serif;">In one sense, statistics is a <u>mindset</u> -- a way of looking at things that occur in the world. These "things" are usually called <strong>data</strong> or <strong>variables</strong>. Data can be numeric in ways that you can measure and label, like miles, dollars, feet, etc.; these are called <strong>quantitative</strong> variables. The other type of data is called <strong>categorical</strong> in that it is dividable into categories, like eye colors, levels of education, income ranges, etc.</span><br /><span style="font-family: Arial;"><br /><span style="color: blue;"> Besides being a mindset, statistics is a set of <u>methods</u> for dealing with data. For example, we might have a list of stock prices for our favorite stock over the past month. We can find the average stock price, the maximum, the minimum, or the price that's smack in the middle (which is called the median). Don't worry about the meanings of all these terms...in time we will work with them all. Right now the important thing to know is that statistics consists of a Mindset and Methods.<br /><br />Sometimes you have all the data in front of you and you want to analyze it; much like the stock prices above. This branch of statistics is called <strong>Descriptive Statistics</strong>. Other times, you only have some of the data and you want to draw reliable conclusions about all of the data. This branch of staitistics is called <strong>Inferential Statistics</strong>. An example of inference is the poll that tracks how many people watch various TV shows each week. Polling companies contact a subset of TV viewers using valid statistical methods that allow them to state with some confidence what TV shows we all are watching. You'd be surprised at how few viewers are needed (compared to all the viewers in the US)!<br /><br />I am going to keep my posts short and sweet. My purpose, at the least, is to teach you a little bit about statistics in each post. My hopes are that I will demystify statistics for you and convince you that statistics doesn't have to be difficult. I will make this as fun and interactive as I can with real-life examples and self-quizzes. I look forward to working with you and getting your feedback to guide future posts! </span><br /><br /><span style="color: blue;"><u>Self Test</u>: For now, select the word at the top of the page that best describes your feeling about statistics.</span></span>Judy Cramernoreply@blogger.com0