Summarizing data can help us understand them, especially when the number of data is large. This chapter presents several ways to summarize quantitative data by a typical value (a measure of location, such as the mean, median, or mode) and a measure of how well the typical value represents the list (a measure of spread, such as the range, inter-quartile range, or standard deviation). Markov's and Chebychev's inequalities show that these summary measures can contain a surprisingly large amount of information about the data.
The farthest one can reduce a set of data, and still retain any information at all, is to summarize the data with a single value. Measures of location do just that: They try to capture with a single number what is typical of the data. What single number is most representative of an entire list of numbers? We cannot say without defining "representative" more precisely. We will study three common measures of location: the mean, the median, and the mode. The mean, median and mode are all "most representative," but for different, related notions of representativeness.
For qualitative and categorical data, the mode makes sense, but the mean and median do not. It is hard to see the connection between the mean, median, and mode from their definitions.
However, the mean, the median, and the mode are "as close as possible" to all the data: For each of these three measures of location, the sum of the distances between each datum and the measure of location is as small as it can be. The differences among the three measures of location are in how "distance" is defined.
For the mean, the distance between two numbers is defined to be the square of their difference. That is, the sum of the squares of the differences between the data and the mean is smaller than the sum of squares of the differences between the data and any other number. (Equivalently, the rms or root mean square of the differences from the mean is smaller than the rms of the list of differences from any other number—the rms is defined and discussed below.)
For the median, the distance between two numbers is defined to be the absolute value of their difference. That is, the sum of the absolute values of the differences between a median and the data is no larger than the sum of the absolute values of the differences between any other number and the data.
For the mode, the distance between two numbers is defined to be zero if the numbers are equal, and one if they are not equal. That is, the number of data that differ from a mode is no larger than the number of data that differ from any other value. Equivalently, a mode is a number from which the fewest possible data differ: a "most common" value.
All three of these measures of location are examples of statistics (with a lowercase "s"): numbers computed from data.
The mean, median, and mode can be related (approximately) to the histogram: loosely speaking, the mode is the highest bump, the median is where half the area is to the right and half is to the left, and the mean is where the histogram would balance, were it a solid object cut out of a uniform block of metal. (All these heuristics are approximate, and depend on the class intervals.)
For illustration, let's compute the mean, median and mode, from the hypothetical data in . These data change when you reload the page, so you can see many examples.
In general, the mean and the median need not be close together. If the data have a symmetric distribution, the mean and median are exactly equal, but if the distribution of the data is skewed, the difference between mean and the median can be large. This is because data in the tails of the distribution have a lot of leverage on the mean, just as a light person can balance a much heavier one on a teeter-totter if she sits much farther from the fulcrum than the heavier person does. The median is smaller than the mean if the data are skewed to the right, and larger than the mean if the data are skewed to the left. Because the mean is (essentially) the balance point of the histogram, a small number of data can affect it a great deal, if they are very large (positive or negative). Corrupting just one datum can make the mean arbitrarily large or small.
The median is affected much less by small subsets of the data. To make the median arbitrarily large or small, one must corrupt half the data. Corrupting just one datum changes the median by a limited amount, and not at all if one of the observations above the median is made larger, or one of the observations below the median is made smaller. Statistics that are not affected too much by small subsets of the data are resistant. The median is resistant; the mean is not.
Which measure of location is the most appropriate depends on what the summary will be used for. If we primarily care about the total, the mean tends to be the most relevant, because the mean is equal to the total divided by the number of data. For example, the mean income of the individuals in a family indicates how much the family can spend on each family member's necessities of life.
Suppose we want to know how much money a family can afford to spend on housing. That depends on the total family income, which is the mean income of the family members, times the number of family members. For a family of five, consisting of two parents who work and three children with no income, the mean income, times five, is the total amount of money the family makes each year. The median income of these five family members is zero, because more than half of them make nothing: the median is not helpful information here. On the other hand, the median can be much more informative than the mean in other situations.
On the other hand, suppose we want to decide whether a country is affluent. At issue, in some sense, is whether most of the citizens have a high income. The mean family income could be quite high even if most families earn essentially nothing—if income is highly concentrated in a few very wealthy families. Then the median family income would be a more meaningful measure: At least half the families make no more than the median, and at least half make at least as much as the median.
Similarly, suppose you are applying for a job as an architect at several large firms, and you want to get an idea of how much money you might expect to be earning in five years if you join a particular firm. Consider the salaries of architects in each firm five years after they are hired. Just one very high salary could make the mean salary high, so the mean might not reflect what is typical. On the other hand, half the architects make the median salary or less, and half make the median salary or more, so the median would give you a better idea of a typical salary.
Choosing a measure of location favorable to one's point of view is a common way to mislead people with statistics. For example, suppose you are the CEO of a company that makes gizmos and gadgets. It might be in your interest to claim to your customers that you have lowered your prices, and to claim to your shareholders that you have raised your prices. Suppose that last year, you sold 100,000 gizmos at $10 each, and 1,000 gadgets at $1000 each. This year, you sold 100,000 gizmos at $8 each, and 1,000 gadgets at $1200 each (see ).
The median price of the 101,000 items sold last year is $10, because more than half of the items sold were gizmos. The median price of the 101,000 items sold this year is $8. The mean price on the price list (without regard for the number of items sold) was $505 last year and $604 this year. The mean price of the 101,000 items sold last year is
(100,000 × $10 + 1,000 × $1,000)/101,000 = $19.80
while this year it is
(100,000 × $8 + 1,000 × $1,200)/101,000 = $19.80.
The mean price per item sold is the same in both years: the total revenue was the same, and the number of items sold was the same. The moral is that one can make data appear to tell conflicting stories by choosing a measure of location disingenuously.
The following exercises check your ability to compute and to use the mean, median, and mode.
(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)
Measures of location
Measures of location summarize a list of numbers by a "typical" value.
The three most common measures of location are the mean, the median, and the mode.
The mean is the sum of the values, divided by the number of values. It has the smallest possible sum of squared differences from members of the list.
The median is the middle value in the sorted list. It is the smallest number that is at least as big as at least half the values in the list. It has the smallest possible sum of absolute differences from members of the list.
The mode is the most frequent value in the list (or one of the most frequent values, if there are more than one). It differs from the fewest possible members of the list.
Measures of location summarize what is typical of elements of a list, but not every element is typical. Are all the elements close to each other? Are most of the elements close to each other? What is the biggest difference between elements? On the average, how far are the elements from each other? Measures of spread or variability tell us.
Consider three mechanical golfers (this example is from Hooke, 1983). In golf, the object is to get a low score—to take fewer strokes to complete the course. Suppose the golfers play as shown in
The golfers' average scores are equal—nominally, they are equally skilled. However, consider what happens when they play each other. Golfer 1 beats golfer 2 when golfer 2 scores 73, which happens 75% of the time. Golfer 2 beats golfer 3 when golfer 3 scores 74, and when golfer 3 scores 70 and golfer 2 scores 69. The first occurs half the time, and, assuming that the players' scores are independent (we'll get to that notion in the second occurs 50% × 25% = 12.5% of the time, so golfer 2 beats golfer 3 62.5% of the time. Finally, golfer 3 beats golfer 1 when golfer 3 scores 70, namely, 50% of the time (they play evenly). Their average scores are equal, but 1 beats 2 more often than not, 2 beats 3 more often than not, and 3 plays 1 even. This shows that there is more going on than the average scores indicate: variability matters too.
Here is another example of the importance of variability. The average number of children under 18 per family in the US was 0.89 according to the 1990 census, so the average family size is about 2.9 people (is this logic sound? what is a family?). If you were in the construction business, that might suggest to you that a two-bedroom home is the right size to build for the average American family (two parents sharing a room, and another room for the 0.89 children). However, family sizes vary over quite a large range; indeed, the same report shows that the average number of children for families that have children is 1.86, so families that have children would tend to need a three bedroom home, rather than a two bedroom home, if the children are to have their own rooms.
Much information is lost in reducing a list of numbers to a single summary number, such as the mean or median. Measures of location alone are not very informative. For example, the histograms in all correspond to sets of data with means and medians equal to zero.
We need more than just the mean or median to tell these distributions apart. In the data cluster both in the middle and at the ends. In the data are more concentrated near the middle—there is much less spread than in the first. is extremely concentrated: The data are much closer to each other than in the other two examples. Measures of spread or variability summarize with a single number whether the observations tend to cluster near the center of the distribution, or how spread out they are. If the spread is small, most of the data are nearly equal; if the spread is large, there are large differences among the data.
The three most common measures of spread or variability are the range, the interquartile range (IQR), and the standard deviation (SD).
The range of a list is the largest value minus the smallest value. It is the width of the smallest interval that contains all the data, so it measures spread. It is not resistant, because changing just one datum can make it arbitrarily large.
The IQR is the upper quartile (75th percentile), minus the lower quartile (25th percentile). It is the width of the interval that contains the middle 50% of the data—and thus is a measure of spread. It is insensitive to the most extreme values of the data (assuming that there are more than four data). The IQR is resistant: changing just one datum has a limited effect on it. Note that neither the range nor the IQR is a range of numbers, despite their names—each is a single number.
The rms (root mean square) of a list measures the average size of its entries. It is defined as follows:
rms = square-root( (sum of the squares of the entries)/(number of entries) )
=[ (sum of squares of the entries)/(number of entries) ]^{½}.
(Recall that a number raised to the one-half power is the square-root of the number; this is the notation we shall use from now on. If you need to review exponentiation, see Assignment 0.)
In computing the rms, we divide by the number of entries before taking the square-root. What difference does it make to square the entries? Squaring them makes every term in the sum positive, so positive and negative entries do not cancel. If we ignored the square and the square root, we would just have the mean of the list, which could be zero, even if all the numbers were large in magnitude, because positive and negative entries could cancel. Squaring the entries before averaging them prevents cancellations.
The rms is not the only measure of the average size of the elements of a list; for example, the average absolute value of the terms is another measure of the typical size of elements in a list. The rms is used more often. illustrates calculating the rms of a list. The example will change when you reload the page.
makes it clear that the mean of the squares of the elements of a list is not generally equal to the square of the mean of the elements of the list: the square of the mean is 0, but the mean of the squares is not.
The rms of a list is zero if and only if all the entries in the list are zero.
The standard deviation (SD) of a list is the "typical size" of the difference between elements of the list and the mean of the list, measured by the rms. The SD measures how spread out the data are around their mean. To find the SD, we first find the mean of the list, then make a list of deviations from the mean:
deviation of value = value − mean of list,
and finally, find the rms of the list of deviations from the mean (the square-root of the average of the squares of the deviations). In the example just given, the mean is zero, so the SD is equal to the rms.
is slightly more complicated. The data in the example will change when you reload the page.
The units of the SD are the same as the original units of measurement. For example, if the list is comprised of measurements of heights in inches, the SD has units of inches. Recall that the rms of a list is zero if and only if all the elements in the list are zero. Thus the SD of a list is zero if and only if all the deviations from the mean are zero, that is, if and only if all the elements are equal to each other (and hence equal to their mean). Similarly, the range of a list is zero if and only if all the elements are equal. In contrast, the IQR of a list can be zero even if not all the elements are the same—only the middle 50% of the observations need to be equal for the IQR to be zero.
For reference, the SDs of the data plotted in the three histograms above are 1.66, 1.15, and 0, respectively. Does that reflect how spread out they appear to be?
Some calculators have a button labeled s, which computes something related to the SD as we have defined it. In the usual definition of s, the sum of squares of residuals from the mean is divided by (number of data −1) rather than by (number of data) before taking the square-root. This is called the sample standard deviation. When the number of data is large, there is not much difference between the standard deviation and the sample standard deviation, but when the number of data is small, the difference can be big.
The following exercises check that you can calculate measures of spread, and that you understand what they mean.
Measures of Spread
Measures of spread summarize how much members of a list of numbers differ from each other.
The three most common measures of spread are the range, the inter-quartile range, and the standard deviation.
The range is the largest element of the list, minus the smallest element of the list: the maximum difference between elements of the list. It is sensitive only to the most extreme values in the list. The range of a list is zero if and only if all the elements of the list are equal.
The inter-quartile range (IQR) is the upper quartile of the list (75th percentile) minus the lower quartile of the list (25th percentile). It measures the width of the interval that contains the middle 50% of the data. It is not sensitive to the extreme values of the list. The IQR of a list is zero if (at least) the middle 50% of the values are equal.
The standard deviation (SD) is the average distance from the data to their mean (the rms of the deviations of the data from their mean). It depends on the values of all the data. The SD of a list is zero if and only if all the elements in the list are equal (to each other, and hence to their mean).
shows the histogram tool again, this time with a new Univariate Stats button. Clicking the button will open a new window that lists the number of data, their mean, standard deviation (SD) , minimum (Min), lower quartile (LQ), Median, upper quartile (UQ) and maximum (Max).
Some variables have simple relationships to other variables, for example, measurements of elevation above sea level in feet, and measurements of elevation above sea level in meters: Each elevation in meters above sea level is 0.3048 times the corresponding elevation in feet above sea level. When the relationship between variables is simple, so is the relationship between their measures of location and spread. An affine transformation or change of variables is particularly simple. Affine transformations have the equation of a line:
(transformed value of x) = a × (original value of x) + b,
where a and b are constants. (Some books call this a linear transformation, because it has the equation of a straight line.) For example, height in inches is related to height in feet by an affine transformation, with a = 12 and b = 0:
(height in inches) = 12 × (height in feet) + 0.
Similarly, temperature in degrees Fahrenheit is related to temperature in degrees Centigrade by an affine transformation with a = 9/5 and b = 32:
(temp in ^{°}F) = 9/5 × (temp in ^{°}C) + 32.
Currencies are related to each other by affine transformations as well, with a = (exchange rate) and b = 0.
The measures of location and spread introduced in this chapter behave quite regularly when a list is transformed by an affine transformation.
How Measures of Location and Spread behave under Affine Transformations
If a list is transformed so that
(transformed value) = a × (original value) + b,
then
The median of the transformed list can differ slightly from a × (median of original list) + b when a is negative; similarly, the IQR of the transformed list can differ slightly from |a|×(IQR of original list) if a is negative, because of the definition of percentiles applied to a list with its signs reversed. Some of these relations are derived in a footnote.
Using these relations can simplify calculating measures of location or spread when the units of measurement are changed. The following exercise checks your ability to use these rules.
Measures of location and spread can tell us a great deal about lists of numbers. For example, for any list, at least half the numbers in the list are no larger than the median, and at least half the numbers in the list are at least as large as the median (this is one way of defining the median). The mean and SD also can tell us about the fractions of values in a list in various ranges.
Suppose that a list of numbers contains no negative number, and that 10% of the values in the list are greater than or equal to 50. What is the smallest the mean of the list could be? The mean would be smallest if all the values in the list were as small as they could be, subject to the constraints that the values were not negative, and 10% equal or exceed 50. If 90% of the values were equal to zero, and the rest were equal to 50, that would give the smallest mean:
0 × 0.9 + 50 × 0.1 = 5.
That is, if a list contains no negative number, and 10% of the numbers in the list are 50 or larger, then the mean of the list must be at least 5. More generally, if any particular fraction of values in a list exceeds a given threshold, and none of the values in the list is negative, then the mean of the list cannot be arbitrarily small. Markov's inequality turns this idea upside down to limit the fraction of numbers in a list that can exceed any given threshold, provided the list contains no negative number. The limit depends on the mean of the list, and the threshold: see .
Markov's Inequality (for lists)
If the mean of a list of numbers is M, and the list contains no negative number then
[fraction of numbers in the list that are greater than or equal to x] ≤ M/x.
A heuristic derivation of Markov's inequality is in a note.
There are 200 students in a class. The average amount of money in their pockets is $15. How many could have $75 or more in their pockets?
Solution. No student can have a negative amount of money in his or her pocket, so Markov's inequality applies. Markov's inequality guarantees that
[fraction of students with at least $75 in their pockets] ≤ $15/$75 = 0.2 = 20%.
Thus at most 20% of the students (40 students) could have $75 or more in their pockets.
If we know the mean of a list and its SD, we know something about how many of the numbers in the list must be in various ranges. Suppose that 25% of the numbers in a list differ from the mean by 30 or more. How small could the SD of the list be? To make the SD smallest, all the numbers should be as close as possible to the mean, subject to the constraint that at least 25% of them differ from the mean by 30 or more. This is achieved by making 75% of the numbers equal to the mean, 12.5% equal to the mean minus 30, and 12.5% equal to the mean plus 30. Thus the SD of the list must be at least
( 0.125 × 30^{2} + 0.75 × 0^{2} + 0.125 × 30^{2} )^{½} = 15.
More generally, if a particular fraction of the values differ from the mean of the list by at least a given threshold, then the SD of the list cannot be too small. Chebychev's inequality turns this around to find a bound on the fraction of numbers in the list that differ from the mean by more than any given threshold. The bound depends on the SD of the list and the threshold.
Chebychev's inequality (for lists)
If the mean of a list of numbers is M and the standard deviation of the list is SD, then for every positive number k,
[the fraction of numbers in the list that are k×SD or further from M] ≤ 1/k^{2}.
A heuristic derivation of Chebychev's inequality is in a note. Chebychev's inequality says that not too many of the numbers in a list can be far from the mean, where far is measured in standard deviations. Conversely, if a large fraction of the values are far from the mean, the SD of the list must be large.
lists some specific bounds implied by Chebychev's inequality:
illustrates applying Chebychev's inequality to find bounds on the fraction of weights in a given range from the mean and SD of a list of weights.
The mean weight of students in a certain class of students is 140 lbs., and the SD of their weights is 30 lbs. What fraction weighs between 90 lbs. and 190 lbs.?
Solution. We cannot get an exact answer, but we can get a lower bound using Chebychev's inequality. The range from 90 lbs. to 190 lbs. is the mean, plus or minus 50 lbs. 50 lbs. is 1 2/3 times the SD of the weights, so according to Chebychev's inequality, the fraction of students who weigh less than 90 lbs. or more than 190 lbs. is at most
1/(1 2/3)^{2} = 1/(1.6667)^{2} = 0.36 = 36%.
Thus the fraction who weigh between 90 lbs. and 190 lbs. is at least 100% − 36% = 64%.
In some problems, it is possible to apply both Markov's inequality and Chebychev's inequality. When that happens, use whichever inequality gives the more precise answer—that is, the inequality that limits the fraction most stringently. illustrates this idea.
On the average, it takes 45 minutes to cross the San Francisco Bay Bridge during rush hour. The SD of the time it takes to cross the bridge is 15 minutes. What's the largest fraction of trips for which it could take more than 2 hours to cross the bridge?
Solution. Travel time is positive, so we can use Markov's inequality. By Markov's inequality,
[fraction of trips for which it takes more than 2 hours] ≤ (45 minutes)/(2 hours) = (45 minutes)/(120 minutes) = 0.375 = 37.5%.
On the other hand, we can also apply Chebychev's inequality, as follows.
2 hours = 120 minutes = 45 minutes + 75 minutes = mean time + 75 minutes = mean time + 5SD.
That is, two hours is 5SD above the mean. On the other hand, 5SD below the mean is
45 minutes − 5×(15 minutes) = −30 minutes.
This is not a possible travel time (it always takes a positive amount of time to cross the bridge). Thus the fraction of trips for which it takes more than 2 hours or less than −30 minutes to cross the bridge must equal the time it takes more than 2 hours to cross the bridge. By Chebychev's inequality,
[fraction of trips for which it takes less than −30 minutes or more than 2 hours] ≤ 1/5^{2} = 1/25 = 4%.
Because the fraction of trips for which it takes more than 2 hours or less than −30 minutes to cross the bridge is the same as the fraction for which it takes more than 2 hours, we have
[fraction of trips for which it takes more than 2 hours] ≤ 4%.
This is a more restrictive bound than the one Markov's inequality gives in this problem (Markov's inequality gave 37.5%) so we should use it instead. (Larger lower bounds are better; smaller upper bounds are better.)
The following exercises check your ability to apply Markov's inequality and Chebychev's inequality.
This chapter introduced several ways to summarize lists of numbers, quantitative data. Some summaries, measures of location, seek to be as close as possible to every element of the list—to typify the elements. The mean, median, and mode are examples: They represent typical values of the list. The mean, median, and mode each are "as close as possible" to all the elements in the list, for different definitions of the proximity of two numbers: for the mean, the distance between two numbers is the square of their difference; for the median, the distance between two numbers is the absolute value of their difference; and for the mode, the distance between two numbers is 1 if the numbers differ, 0 if they are equal. The mean is the sum of the elements, divided by the number of elements. The median is the smallest element that is at least as large as at least half the elements. The mode is the most common value in the list. The mode makes sense for qualitative and categorical data as well as quantitative data, but the mean and median make sense only for quantitative data. The mean, median, and mode differ in their sensitivity to changes to the data, or resistance. A statistic that can be changed arbitrarily by altering a single datum is not resistant. The median is resistant. The mean is not resistant. The resistance of the mode depends on the distribution of values in the list.
The rms (root mean square) measures the average size of the elements of a list, without regard to their signs. The rms is not resistant. Other summaries, measures of spread, reflect how the values of the list differ from each other. Examples include the range, the SD (standard deviation), and the IQR (inter-quartile range). The range of a list of numbers is the largest number minus the smallest number. The range is zero if and only if all the numbers in the list are equal. The range is not resistant. The SD measures the average size of the differences between the mean and the elements of the list: It is the rms of the list of deviations from the mean. The SD of a list is zero if and only if all the numbers in the list are equal. The SD is not resistant. The IQR is the upper quartile minus the lower quartile. It is the width of an interval that contains the middle half of the data—25% below the median and 25% above the median. The IQR can be zero even if not all the numbers are equal, but the middle 50% must be equal. The IQR is resistant. If the units of measurement change by an affine transformation, measures of location and spread in the new units of measurement have simple relationships to their values in the old units.
Measures of location and spread contain a surprising amount of information about lists of numbers: Markov's inequality limits the fraction of elements of the list that exceed any given threshold, in terms of the mean of the list and the threshold, provided the list contains no negative number. Chebychev's inequality limits the fraction of elements whose difference from the mean of the list exceeds any given threshold, in terms of the SD of the list and the threshold.