View Transcript
6.SP.5c Part 1 Transcript
This is Common Core State Standards Support Video in Mathematics. The standard is 6.SP.5c. The standard reads: Summarize numerical data sets in relation to context, such as by: Part c states Giving quantitative measures of center (median and/or mean) and variability interquartile range and/or mean absolute value deviation) as well as describing any overall pattern and any striking deviations from the overall pattern with reference to the context in which the data was gathered. The ideas, the concepts of median and mean, are probably the most familiar to us. So we need to review the two items that deal with variability, mean absolute deviation, and interquartile range. These two concepts are probably the more difficult, so let's go over those mean absolute deviations.
It sounds complicated, but it really is a pretty simple concept. It's simply the average distance of all of the data points from the mean or the average of the data. So let's look at this sample. You have 4 data points. Maybe these are test grades; maybe they're the high temperatures of the last 4 days. Again it can fit a lot of different contexts. Now when we take these 4 data values and we do the average, it comes out to be 81. So now the mean absolute deviation is just how far my data points are away from the average of 81. The first step is to take the difference of each of those data points from the mean, which in this case is 81.
Now we have a little bit of a problem here. We have some negative values, and we have some positive values. We are only concerned with how far away that each data point is from the mean, regardless of direction. So for example, here the 78 is 3 away from 81, and so is 84. The difference is that one is in one direction, and one is in the opposite direction. So we have 81, and one of them is 3 away in that direction. The other one is 3 away in this direction. But they're still 3 away. So the way to adjust for that is to just deal with the absolute value, in other words, all we need to do is take our values and just make them all positive. Now we take these 4 values that constitute the difference away from the mean and then just do what we typically would do for an average. We add all of these up, and then divide in this case by 4. So we get a mean absolute deviation of 2.
Let's try a second set of data. Now this data was set up deliberately where we also have the exact same mean. We have a mean of 81, but this time the values are quite large. Our values are -19, -8, 12, and 15. Again we're not concerned with the direction, so we make them all absolute value, the positive differences, and then when we average those 4 values; we get 13.5. That's a much larger value than we had earlier with the mean absolute deviation of 2. Notice here at the bottom of the page we've done a dot plot to kind of see where our data falls. Notice the distinctions these 4 data points here constitute. Our first set of data—and notice that they're all pretty close to our mean of 81. So that smaller value of 2 for the mean absolute deviation makes sense.
Now look at the other data points and notice that they are much more widely dispersed. They're pretty scattered, which correlates to the higher mean absolute deviation value of 13.5. So when you're dealing with the mean absolute deviation, the closer that your value gets to 0 the more and more densely packed your data is going to be. The larger and larger that the mean absolute deviation becomes, your data, likewise all those points, are going to be more widely scattered. Now let's look at the interquartile range. The mean absolute deviation dealt with the mean. Now the interquartile range is going to focus on medians. So the computation for interquartile range is always done with the median not with the mean. When you look at the word quartile, that brings to mind quarters, and that's exactly what we're dealing with here. We're dealing with your set of data that is being divided up into four equal parts to get the interquartile range. We're actually dealing with the middle half of the data. This 25 percent, and this 25 percent, and your interquartile range is simply how far is it from this median to this median.
Now a bit of caution here. When we talk about the interquartile range, one perspective of a quartile refers to the subset of all the data values in each of those four parts. so that's one perspective. Another perspective of a quartile is that it refers to the cut-off values between the subsets when we're talking about quartile 1. When we're dealing with a computation, then the reference is to that value. For quartile 3, the reference is to that value. But if we're looking at it from a data perspective, it deals with the data that falls in the third quartile.
It might be a little bit easier to understand if we go ahead and look at a set of data and determine what the interquartile range would be. Now since we are dealing with medians, the first thing that we have to do is to ensure that the data is listed in order from least to greatest. In this case, that hasn't been done yet, so that's the first thing that we need to do. Now notice here, there's a little formula here. It's a nice little easy way to remember where your median is going to be: 0:06:41.239, 0:06:44.90. Your middle term is simply going to be however many there are in your data set. Add 1, and divide by 2. So in this case, our number of data value is 14. Add that to one. That would be 15 ÷ 2; that's 7 1/2. So in this case, I know that my median would fall, it wouldn't be a value of 7 1/2, halfway between the seventh and the eighth terms. So let's determine our initial median, the median for all of our data sets. So we put them in order, and we know that the median is going to fall halfway between the seventh and the eighth terms, which would be here. Our seventh term is 16; our eighth term is 17. Halfway the average between those two would be 16.5. So that's our median for the whole data set.
Now we have to determine the medians of the lower half and the top half. Again here's our lower half, and so I need to determine what my median is for that data set. Then over here, the upper half, we need to determine what that median is. So in this case, we have seven values on the lower half and likewise on the upper half. So 7 + 1 = 8 ÷ 2 is 4. So I know that the fourth term will be the median for each of the halves. So that's a fourth term there. Let's see 1, 2, 3 that's a fourth term there. Now notice that in this data set the values for your third and first quartiles actually were values that are part of the data set. Sometimes that'll happen. Sometimes it won't notice that the median for all of the data set was 16.5, which is not one of your data values. So now, we have to subtract the value for the third quartile. The upper end of the third quartile which is 24. We also have to subtract the 9, which is the upper value. You know the median for the lower half of the data, which is a 9 again; so 24 - 9 = 15. So our interquartile range is 15 for this. Now notice that this is not a box plot. Okay. This is just a little visual explanation, a visual representation where our data falls in each of the quartiles.
Let's do a second set of data. Now here we have an odd number of data values. This time we have nine data values. Again the first thing that we have to do since we are dealing with the median is we have to ensure that the data is in order from least to greatest. So we put them in order from least to greatest. Now, 9 + 1 is 10 ÷ 2 is 5. So I know that our middle term is going to be the fifth term, which in this case is going to be a 15. Then we have to find the median of the lower half of the data and the upper half of the data. We have four, so we know that 4 ÷ 4 + 1 = 5 divided by 2, that's 2 1/2. I know that my median is going to fall halfway between the second and the third data points, which for the lower half would be 8, and for the upper half that would be a 17. Then to get our interquartile range, we simply take the difference between those two values, the 17 and the 8, and so we get interquartile range of 9. So that is basically again the range of your middle half of your data. Notice here that those two values, the 8 and the 17, were not data points. Again there's no set rule it just depends on where the data falls as to whether the values for the first and the third quartiles are data points or not. Again this is not a box plot, it's just a visual representation of our data in this case. Now a little bit of caution here, the way that we've been doing the medians here is something called the m and m method,…Moore and McCabe. The primary reason for that is because a lot of the calculators are programmed to do it this way.
Now there's other ways of getting your medians for your interquartile range. Look at the second box, and this is an exact same data set. There's a second method called the Tukey's method, developed by John Tukey. The difference is that when you have an odd number of data points, for this method, the median, the middle term for the odd number of data points, is actually counted for both the upper and lower halves of your data. So you're actually counting that data point twice, and if you look at the data, that is going to make a difference in a lot of cases as to where your medians are going to fall for the upper half of your data and the lower half of your data. Again it's a different approach.
Not a whole lot of difference but it is different. These three methods here Mendenhall and Sincich, Minitab and Freund, and Perles are methods used with Microsoft Excel. These methods are other methods of determining your medians for your interquartile range. Those are based on statistical software packages. So don't be surprised if your students are using their computers or something off of the Internet to do the computations. They can end up with different values simply because of the differences in the software. Again use some caution here. There are different approaches. There's different methods to actually determine the values for your medians that will then be used to determine your interquartile range.
6.SP.5c Part 2 Transcript
Okay let's go through some of the finer details on how to do a box plot. Now how to do a box plot is actually part of an earlier standard. That's back in standard SP.4, which states: Display numerical data in plots on a number line including dot plots histograms and box plots. Now notice here that I've actually got two different plots. I've got a dot plot and the beginnings of a box plot. Also normally this doesn't happen, but we're doing this just to become familiar with the different plots. Again you would typically just do one or the other. Here we're overlapping one on top of the other. Now what happens with your box plot to get your rectangles for your middle quartiles, for your second, and third quartiles, it's determined by your values for your first and third quartiles. So notice here that this segment corresponds to the 17. This segment over here corresponds to the 8. Then of course, this here is the median for the whole data set. So that's what I used to draw those two rectangles.
Now there's also another piece to this. The box plot is sometimes referred to as a box and whiskers plot. So now I know that there are some segments that are drawn on each side. The question is well how far do I take those line segments? So here's how that is determined where to draw those whiskers. It is based on outliers. Now we have what we call mild or suspect outliers and that's based on 1 1/2 times your interquartile range. Then we also have what is referred to as extreme or highly suspect outliers, and those values are based on 3 times the interquartile range. So let's take this data set and determine our outliers. First for our mild outliers, we're going to take 1 1/2 times our interquartile range, which in this case was 9 the 17 - 8. So if we multiply one point 5 times 9 we get 13.5.
Now those boundaries are referred to as fences, so to get the fence at the lower end, we would start at 8 and go to the left 13.5. So what's going to happen is I'm actually going to be a -5.5. so I'm actually going to be out here, and then for the upper fence, the fence on the right, we're going to start at 17 and go to the right 13.5 and that would be a 30.5, which would be way out here somewhere. So notice that none of my data points are on the outside of those 2 fences; everything's on the inside. So basically, I note at the bottom, it says whiskers are placed based on data points to the right and left of those fences.
So in other words, another way of thinking of it would be that your whiskers are going to go to your data points that are not outliers. So another way to think of it would be that your whiskers are going to be drawn out to the last value that is not an outlier in each direction. So over here on the right-hand side, I only go this far, because again, that's my last data point that is not an outlier. Then the same thing to the left. I go this far, because that's my last data point that is not an outlier going in the opposite direction. Okay let's take a closer look at outliers. There's no data set here; there are just some made up values to kind of concentrate on the idea of outliers.
So let's say, for example, I've got my second and third quartiles here, and it goes from 18 to 22. My interquartile range is 4 for my fence; for my mild outliers, that's going to be 1 1/2 x 4, which would be 6. So I'm going to start at 22 and go to the right 6, which would put me at 28. Then I'm also going to start at 18 and go to the left 6, which would put me at 12. So those are the fences for my mild outliers. My extreme outliers are based on 3 times the interquartile range, which in this case would be 3 x 4 = 12. So for my extreme outlier fences, I'm going to start at 22 and go to the right 12, which would put me at 34. Then I would start at 18 and go to the left 12, which would put me at 4. Now I just made up some data points, and we have the dots on our graph. Here our whiskers are going to go to the last values that are not outliers. So here to the left, here's my value, my last value that is not an outlier. So that's how far I take the whisker, to that side, and then over here, that is my last value, and to the right that's not an outlier. So that's how far I take that line segment.
Notice here that I've got one value for this data set that would be considered a mild outlier. Then notice for the extreme outlier fences, I have one value way out here that would be considered an extreme outlier, in fact, I have a second one. I have another extreme outlier over here, so again just a little bit more detail as far as how to compute and how to figure out your outliers and where your fences should go.
6.SP.5c Part 3 Transcript
Okay now let's consider some context to where we look at the data. Now it isn't stressed in this particular standard, but it's always a good idea to do dot plots, because that really give you a much better handle of the nature of your data. So for example here, this is a nice set of data, because there's no surprises. The mean and the median are going to be pretty representative, and in fact, they're going to be pretty close to each other. Plus your interquartile range and your mean absolute deviation are going to be fairly small. Also because there's not much of a range, there is not a whole lot of variability here as far as your data. They're pretty closely packed together, so let's look at a context here.
Let's suppose you're a teacher, and you've given a test to your students. You have a class of 28, so there's 28 test grades. All right, so if I take these, and I crunch out the numbers, and I determine the average of the mean to come out to be 27.9, then I make sure that my data is in order from least to greatest. I determined that the medium was a 78 for my whole data set. Then for the upper and lower the median is 80; for the lower-half, the median is 76. So for those two values, I then take the difference between the two, which is 4, so my interquartile range is 4.
Now with a bit of computation, now I have to take all of my data points and figure out how far away each one of those is from the mean of 77.9. When I do all that, I get a mean absolute deviation of 2.6. Then I do my box plot, and I have overlaid that with a dot plot to see the patterns as to where my values are. I notice that it's pretty condensed; it's pretty packed. The mean and the median are very close together. My interquartile range and my mean absolute deviation values are pretty small. So it makes sense that the data looks like what it does and even went as far as to determine the mild outlier fences. In fact there are none. There just happens to be one data point, the 70, that is sort of on the borderline. It's exactly where the fence is.
As a teacher, I would look at this and think, well the students did fairly well. Nobody really did poorly, but the highest grade was 85, so there really wasn't much on the upper end, but they are pretty closely packed together. So as a teacher, there are some pluses and minuses here. Now let's say you gave that same class a second test. Still 28 students. But notice some of the differences in this data. Deliberately set it up to where the mean was exactly the same. The mean is 77.9 again. Our median is 84. Our values for the third and the first quartiles are 91 and 68; so our interquartile range is 23. Then again, we did all the number crunching. We took all of our individual data points and figured out how far away they are from 77.9. We get a mean absolute deviation of 13.8.
Now notice that even though the mean is the same, the data looks a lot different this time. The data is a lot more spread out. It's more dispersed. Notice that there is, you know, some difference between the mean and the median 84, versus 77.9. But the telltale values here are the interquartile range and the mean absolute deviation. Those are much larger than in the previous data set that parallels what the data looks like. Again it is a lot more widely dispersed. In this case, there's a lot more at the upper end. So as a teacher, not too bad, although you know there were several students that did poorly.
Let's take a third set of test grades. Again 28 is the number of data points. But this time, our mean is different. Now it's dropped to 70.4. Our median is 90. You know the halfway point between the 28 values. Our third and first quarter values are 97n and 4` for those two medians. That is a difference of 56. So that's our interquartile range. Then half the average distance of all of our data points from the average of 70.4 and 28. Three widely dispersed again. But this time, notice that there is not very many values that are close to the to the mean. There's a big difference between the mean and the median, and this time the data seems to be clustered at the two ends, the upper end and at the lower end. So I guess this would be the results to tell us that the students either knew the topic very, very well or they didn't know it hardly at all. Again a little bit different data set here, and the telltale values are again the interquartile range and the mean absolute deviation. Those are very high values relative to the values of your data points. So the box plot and the dot plot also really help here to get a better handle on what the data looks like.
Let's take a different context. Let's say you have 20 lions, and all of the lions are weighed. Here's our dataset in order from least to greatest. We figured out that the average was 388, our median is 391, and our median for the whole data set fell here. So it falls halfway between 327 and 455, which is 391. Then for the medians for the upper and lower halves of the data, we get 481 and 293. That difference is 188. So our interquartile range is one 88.t Then when we figure out the average of the distances of all our data points from the 388, the mean absolute deviation is 90.9. This is sort of like that last data set for our third test grade. The data is widely dispersed, and there's very few values close to the average.
The median shows very little difference between the mean, the median, and the data is clustered way over at the upper and lower ends. In statistics, this is often referred to as a bi-modal distribution. So what could cause this? Well from science, we know that your male adult African lions are actually quite a bit bigger than your female adult lions. So that'll be a very logical explanation for this that we've actually got. Probably 10 female lions and 10 male adult lions and that kind of explains the difference, why the data is clustered the way that it is.
Let's take that data set for our lions and adjust it just a little bit. Let's say that we knock out the biggest of the male lions. Now look what that does to our data set. Okay now we have 19 values. Notice that there's very little different between the mean absolute deviation. There's very little difference in the interquartile range, and there really isn't a whole lot of difference in the average weight, 382.3 compared to 388. But look at the median, 391 versus 327. It shifted way way over to the left. But notice why. Here it just so happened that the median fell between, you know, that expanse of numbers between the male and female lion weights. But now this shifted to the left. So now I'm at 327, which we would assume would probably be the heaviest of the female lions.
So again a big shift simply because of eliminating one of the male lions. So this is just an example of not to make assumptions as far as your mean and median and so forth, because again, just one slight difference in your data values can make a lot of difference, especially with something like your median. You know all it took was eliminating one of our data points, and it made a whole lot of difference in where the median fell.
Let's take another scenario, candy sales. Let's say you have 15 students that went out and sold a bunch of boxes of candy. Okay and here's the values. Notice that almost all the kids sold anywhere from 1 to 5 boxes. But you have 2 kids that were way out here. One sold 68 and the other one sold 74 boxes. This data set is an example of what can happen when a set of values contains extreme outliers. Notice that the primary impact was on the mean and the mean absolute deviation. When we take the mean, it comes out to be 12. So it makes it look like our kids on average sold 12 boxes of candy each, when in fact most of them really sold closer to 2 or 3. So again, the extreme outliers made a big difference with the mean.
Notice that the median is probably a lot closer to the actual values than we would expect. You also have a small interquartile range of 5-2 being 3. But look at your mean absolute deviation. It comes out to 15.7. So what that's saying is that all of my data points are an average 15.7 away from the mean, which really isn't. I mean mathematically that's true. But when you look at your data, wait a minute, most of my data points, the mean are not that far. But again, it's the distance from the mean not from the median. So in cases where you have extreme outliers, it really can make a big difference on some of your values, especially on your mean and your mean absolute deviation.
So to wrap-up, here are some dot plots to kind of show what could happen in different contexts. Again there's a whole lot of different contexts that you can make up as far as you data; so many real-life scenarios that you can come up with the data would probably be represented by one of these scenarios. You might have a scenario like this where there's not going to be a whole lot of surprises. What's going to happen there is that it's pretty condensed. There's not going to be much of a difference between your mean and your median. You're going to have fairly small values for your interquartile range and for your mean absolute deviation.
The second set as you can tell is a lot more widely dispersed, so you're mean and your median will probably be fairly close together. But you're going to have some large values for your mean absolute deviation and for your interquartile range, which would give you the tip off that there is a lot of variability in your data points.
The third one is representative of what happened with the lions. That it is possible to have this kind of data and, in this case, your mean would definitely be misleading. So again, that's why your visual representations of your data, something like this dot plot would really help to see what's really going on. Then another type of scenario or possibility will be this last one here. This would be one where you might have an outlier or two or three that really might have an impact on your data.