In we studied ways of collecting data about subsets of a population to estimate parameters of the population. That chapter asserted that the error in estimating a parameter from a statistic computed from a probability sample can be quantified, while the error in estimating a parameter from other kinds of samples generally cannot be determined. This chapter studies sampling quantitatively, focusing on the error estimating a parameter using a statistic computed from a simple random sample: a sample of size n drawn from a population of N units in such a way that each of the _{N}C_{n} such subsets is equally likely to result. We assume that the frame is identical to the population. After introducing measures that summarize the accuracy of estimates, we use results from to calculate various measures of the error of estimating the population mean or population percentage of a box of numbered tickets using the sample mean or sample percentage of a simple random sample. The error summaries are expressed in terms of the mean and SD of the numbers on the tickets in the box, along with the sample size n and the population size N. Results similar to those in this chapter exist for other probability sampling designs (like those described in the previous chapter), but this text discusses only simple random sampling, and random sampling with replacement. For more details and references, see Kish (1965).
An estimator is an assignment of a number (the estimate of the parameter) to each possible random sample of size n from the population. For example, the sample mean assigns to each sample of size n the average of the n values in the sample. The rule that assigns values to samples is called the estimator, and the value that is assigned to any particular random sample is the estimate. An estimator is a special case of a statistic, a number computed from a sample.
Because the value of the estimator depends on the sample, the estimator is a random variable, and the estimate typically will not equal the value of the population parameter. We need to understand how the value of the estimator varies with different possible samples, to be able to say how close or far from the parameter the estimator is likely to be. We shall use our characterization of random variables in terms of their expected value and chance variability or sampling error to study how estimators behave.
We can write the value the estimator takes for a particular random sample as the sum of three terms: the parameter we seek to estimate, systematic bias, and chance variability:
estimator = parameter + bias + chance variability.
The chance variability (also called sampling error in this context) reflects the luck of the draw—which particular units happened to be in the sample. The bias is a systematic difference between the value the estimator takes, and the value of the parameter—a tendency for the estimator to be too high or too low on the average. The bias is defined to be
bias = E(estimator) − parameter.
Equivalently,
E(estimator) = parameter + bias.
The bias is the difference between the expected value of the estimator and the true value of the parameter. If the bias of an estimator of a parameter is zero, the estimator is said to be unbiased: Its expected value equals the value of the parameter it estimates. Otherwise, the estimator is said to be biased.
Because the expected value of the difference between the estimator and the parameter is the bias, the expressions above imply that the expected value of the chance variability is zero:
E(chance variability) = 0.
The average of the chance variability across all _{N}C_{n} samples of size n is zero: In the long run in repeated sampling, the chance variability tends to average out.
The typical "size" of the chance variability is the standard error (SE) of the estimator—the SE of the estimator is the square-root of the expected value of the square of the chance variability:
SE(estimator) = ( E(estimator − E(estimator))^{2} )^{½}
= (E( (chance variability of estimator)^{2}) )^{½}.
In most problems, there is no estimator that is guaranteed to give the right answer, because the value of the estimator typically depends on the sample. The error is the difference between the estimate (the value of the estimator for a particular sample), and the true value of the parameter. That difference is the bias plus the chance variability. The bias is the long-run average difference between the parameter and the estimate if we repeatedly drew random samples of size n, calculated the value of the estimator for the sample, and subtracted the parameter from the estimate. The standard error measures the long-run average spread of the estimated values in the same hypothetical scenario.
Both bias and standard error contribute to the average size of the error of an estimator. If the bias is large, on average the estimator overshoots or undershoots the truth by a large amount. If the SE is large, the estimator typically is far from the truth, even if its average is close to the truth. A common measure of the overall error of an estimator is its mean squared error (MSE):
MSE(estimator) = E( (estimator − parameter)^{2} )
The mean-squared error is the expected value of the square of the error, the difference between the estimator and the true value of the parameter. The mean-squared error in fact can be written in terms of the bias and SE:
MSE = bias^{2} + SE^{2}.
The MSE of an unbiased estimator is the square of its standard error. The units of MSE are the squares of the units of the estimator. The square-root of the MSE, also called the root mean-squared error (RMSE), is another measure of the average error of an estimator; its units are the same as the units of the estimator. The RMSE is a reasonable summary of the average error of an estimator in repeated sampling. The RMSE is easier to interpret than MSE because its units are the same as the units of the estimator.
The MSE and RMSE measure the average error of an estimator. That is, we expect the value of an estimator to differ from the value of the parameter by roughly the RMSE. For any particular sample, however, the estimate could differ from the parameter by more than or by less than the RMSE. Typically, we cannot tell how much they differ, because we only know the value of the estimator, and not the true value of the parameter.
Estimating from a sample is like shooting a rifle. The parameter is the bullseye, and each shot is the value of the estimator from one random sample. A systematic tendency for all the shots to miss the bullseye in the same direction is bias: Bias is the difference between the average location of the shots, and the bullseye. The scatter in the shots is measured by the standard error: the average of the distances between each shot and the average location of all the shots. The average squared distance between the bullseye and where the shots land is the mean squared error. For the mean squared error to be small, both the bias and the standard error must be small. If the standard error is zero, but the bias is not, the estimator is like a very accurate rifle that has its sights mis-calibrated: All the shots hit the same spot, but that spot is not the bullseye. If the bias is zero but the standard error is not, the estimator is like an inaccurate rifle that is sighted in correctly: The shots are scattered around the bullseye, but typically miss the bullseye. If both the bias and the standard error are zero, so is the mean squared error, and the estimator is like a very accurate rifle that is sighted in correctly: All the shots hit the bullseye.
There are always many estimators one could consider using to estimate a given parameter. We need some reasonable criteria for picking a sensible estimator. A branch of statistics called Decision Theory addresses the problem of finding an estimator that is optimal, given a criterion for comparing estimators. For example, many statisticians consider MSE (or RMSE) to be a reasonable measure of the accuracy of an estimator. In choosing among a collection of estimators of a particular parameter they might choose the estimator that has the smallest MSE. MSE is a common measure of accuracy, but certainly not the only one.
Other statisticians believe that it is more important for an estimator to be unbiased than to have the smallest possible MSE. Those statisticians might limit their choices to unbiased estimators. Within the collection of unbiased estimators, they might seek the one with the smallest SE (and hence the smallest MSE among unbiased estimators, because the MSE of an unbiased estimator is equal to the square of its SE).
For many parameters, it is possible to find an unbiased estimator if the sample is drawn by simple random sampling. However, it is common that the estimator with the smallest MSE is biased—an unbiased estimator cannot always have the smallest possible MSE. In the next section we look at two very common parameters—the population mean and the population percentage—and at the most common estimators of those parameters—the sample mean and the sample percentage. These estimators are unbiased when the data are a simple random sample.
We saw in that the expected value of the sample mean of n random draws with or without replacement from a box is equal to the population mean, the average of the numbers on the tickets in the box. Thus the sample mean is an unbiased estimator of the population mean. Because a percentage is the mean of a list that consists of only zeros and ones, the sample percentage φ is an unbiased estimator of the population percentage.
Because the sample mean and sample percentage of simple random samples are unbiased estimators of the population mean and population percentage, respectively, they would seem to be reasonable estimators of those parameters. In fact, they are the most widely used estimators of the population mean and the population percentage. Because the sample mean and sample percentage are unbiased, their MSEs are the squares of their SEs, which we studied in
shows a tool to visualize the bias, SE, and MSE of the sample mean and sample percentage under simple random sampling.
Whenever you load this chapter, the box in will be filled with 20 random integers between 0 and 50, the sample size will be set to 5, and the number of samples to take will be set to 1. The average of the 20 numbers in the box is given as Ave(box) on the left of the figure; SD(box) is the SD of the numbers in the box. When you click Take Sample, the computer draws a pseudo-random sample of size 5 without replacement from the box, and computes the mean of those 5 numbers. That mean is appended to a list of values of the sample mean from previous times you pushed the button. A histogram of that list of observed values of the sample mean is displayed and Mean(values) and SD(values) show the mean and SD of the list of observed values of the sample mean. If you alter the contents of the box, the process starts over.
Click Take Sample a few times to get a feel for what happens. The values of the sample mean will tend to be scattered, and rarely will equal the average of the numbers on the box exactly. Then change the value of Take_________Samples to 1000, and draw a total of 10,000 samples of size 5. You should now see that there is some structure to the random values of the sample mean: The average of the observed values of the sample mean should be quite close to the expected value of the sample mean, and the SD of the observed values of the sample mean should be quite close to the SE of the sample mean. Try replacing the contents of the box with a list of only zeros and ones, to confirm that the same results hold for the population percentage and sample percentage.
Recall that for a simple random sample (a random sample without replacement) of size n from a box of N tickets labeled with numbers whose SD is SD(box),
The first term in the product on the right hand side is the finite population correction, and the second term is the SE of the sample mean for sampling with replacement. If the size of the population is much larger than the size of the sample (if N >> n), there is not much difference between sampling with replacement and sampling without replacement, no matter how many ones or zeros are in the list, and the finite population correction is nearly unity—see
Suppose we did not know the contents of the box, but were told the size of the population (N), and were given a simple random sample of size n from the population. What might we infer about the population mean?
It would be reasonable to guess that the sample mean was roughly equal to the population mean, give or take a random amount that is expected to be about one SE of the sample mean. Unfortunately, if we do not know the contents of the box, we are unlikely to know the SD of the numbers in the box, so we cannot calculate the SE of the sample mean. We would have an estimate of the population mean, but would have no idea how far off the estimate was likely to be (at least, not without extra work, as described presently).
There are two common ways to estimate the uncertainty of the sample mean as an estimate of the population mean:
For any list that contains only zeros and ones,
SD(list) ≤ 50%,
no matter how many ones or zeros are in the list. This worst case corresponds to a box of tickets of which 50% are labeled with zeros and 50% are labeled with ones; then
SD(box) = (0.5×0.5)^{½} = 0.5
= 50%.
Thus the SE of the sample percentage for a simple random sample of size n is at most
f × 50%/n^{½},
where
f = ((N − n)/(N − 1))^{½}
is the finite population correction.
In summary, the sample percentage φ of a simple random sample is a reasonable estimate of the population percentage, and its root mean squared error is no larger than
f × 50%/n^{½}.
This estimate of the RMSE can be extremely conservative (i.e., much larger than the true RMSE of the sample percentage). We can use the sample itself to estimate the SD of the box, and hence the SE of the sample percentage or sample mean. If the sample size is large, the estimated SE is likely to be close to the true SE.
The bootstrap estimates the uncertainty of the sample percentage by pretending that the sample is the population, and calculating the uncertainty the sample percentage φ would have if we were drawing random samples repeatedly from the sample—as if the sample were the population. If we are sampling without replacement, we have to inflate the size of the sample to match the size of the population, by imagining we are sampling from a population the same size as the real population, but with a proportion of ones that matches the proportion of ones in the sample. That is, the bootstrap estimate of the SE of the sample percentage φ is the SE the sample percentage would have if the proportion p of ones in the box were equal to the proportion φ of ones in the sample. This corresponds to estimating the SD of the box by the SD of the sample, namely,
(bootstrap estimate of SD) = s^{*}= ( (fraction of ones in the sample)×(fraction of zeros in the sample) )^{½}
= ( φ × (1 − φ) )^{½}
If the sample is large, this is likely to be close to the SD of the box. If the sample size is small, the uncertainty in the bootstrap estimate of the SD is large.
The corresponding bootstrap estimate of the standard error of the sample percentage, SE(φ), is
(bootstrap estimate of SE of sample percentage) = f × s^{*}/n^{½}
= f × ( φ × (1 − φ) )^{½}/n^{½},
f = ( (N − n)/(N−1) )^{½}
The following exercise checks your ability to compute the sample percentage, the upper bound on the SE of the sample percentage, and the bootstrap estimate of the SE of the sample percentage, and refreshes your knowledge of sampling designs.
The bootstrap estimate of the SD of the box can be used even when the box contains numbers other than only zeros and ones. The key idea of the bootstrap is to estimate the SD of the population by the SD of the sample:
estimated SD of box = s^{*} = ( (1/n) × ( (x_{1} − M)^{2} + (x_{2} − M)^{2} + … + (x_{n} − M)^{2} ) )^{½},
where {x_{1}, x_{2}, … , x_{n}} are the data and M is their sample mean:
M = (1/n) × ( x_{1} + x_{2} + … + x_{n} ).
It turns out that the bootstrap estimate of the SD of the box is biased: Its expected value is less than the SD of the labels on the tickets in the box. The bias can be understood from the characterization of the mean as the number from which the rms of the deviations is smallest—see The rms of the deviations of the data from their own (sample) mean never is larger than, and typically is smaller than, the rms of the deviations of the data from the mean of the labels on all the tickets in the box (the population mean). The expected value of s^{*} is the average of the possible values of s^{*}, weighted by their probabilities. Each possible value is at most the rms of the deviations of the sample from the population mean, and typically less. The average is thus less than the average rms deviation of the numbers in the box from their population mean, so the expected value of s^{*} is less than the SD of the list of numbers on the tickets in the box.
For that reason, it is common to estimate the SD of the box using an estimator that takes a slightly larger value than s^{*}, no matter what the sample happens to be. This more common estimator, called the sample standard deviation s, differs only slightly from s^{*}: s^{*} divides by n before taking the square root to form the rms of the residuals, while s divides by n−1 before taking the square root.
The Sample Standard Deviation s
If there are n data,
{x_{1}, x_{2}, … , x_{n}},
with sample mean
M = (x_{1} + x_{2} + … + x_{n})/n,
then the sample standard deviation s is defined by
s = ( ((x_{1} − M)^{2} + (x_{2} − M)^{2} + … + (x_{n} − M)^{2})/(n−1) )^{½}.
The relationship between s and s^{*} is
s = s^{*} × ( n/(n−1) )^{½} ,
so s is always larger than s^{*}, by a fraction that is negligable when the sample size n is large. Note that s^{*} is the standard deviation of the sample, while s is the sample standard deviation. For samples that contain only zeros and ones,
s = ((sample percentage)×(1 − sample percentage) ×n/(n−1) )^{½}.
When the box is known to contain only zeros and ones, it is more common to estimate the SD of the box by s^{*} than by s.
For sampling with replacement, s^{2} is an unbiased estimator of the square of the SD of the box. However, that does not imply that s is an unbiased estimator of SD(box) (recall that E(X^{2}) typically is not equal to (E(X))^{2}), nor is s^{2} an unbiased estimator of the square of the SD of the box when the sample is drawn without replacement. The square of the SD of the box is called the variance of the population, the average of the squares of the deviations of the numbers from their mean; s^{2} is called the sample variance.
shows a tool to study the sampling distribution of the sample variance s^{2}.
Use to draw 10,000 samples from the box with and without replacement. Note that the average of the observed values of s^{2} approaches its expected value in both cases. However, the expected value of s^{2} is the square of the SD of the contents of the box (the population variance) when the sample is drawn with replacement, but larger than the population variance when the sample is drawn without replacement.
We now have the ingredients to estimate the population mean, and to estimate the SE of the estimate of the population mean:
Estimating the population mean or percentage
Consider a random sample of size n from a population of size N, taken with or without replacement.
The sample mean and sample percentage are unbiased estimates of the population mean and population percentage.
The following example illustrates using the sample mean to estimate the population mean. The example is dynamic: The data tend to change when you reload the page.
The following exercise checks your ability to estimate the population mean from the sample mean, and to estimate the uncertainty of that estimate. The exercise is dynamic: The data tend to change when you reload the page.
If the sample size is small enough, relative to the size of the population, then the finite population correction is close to one, and the SE of the sample mean essentially depends only on the sample size n, and not the population size N.
For example, suppose we seek to estimate the percentage of Democrats in the San Francisco Bay Area, and in the United States as a whole, using the sample percentages of simple random samples of size 3000. Because 3000 is but a small fraction of either population, the SE of the two sample percentages would be about the same, assuming that the percentage of Democrats in the San Francisco Bay area is not that different from the percentage in the United States as a whole, even though the United States has a much larger population than the Bay Area.
Some studies report a margin of error for a parameter estimates. This is particularly common for sample surveys. "Margin of error" does not have a widely accepted definition, but typically it is either one or two times the estimated SE of the parameter estimate.
The following exercises check your ability to estimate population percentages, and your understanding of the dependence of the SE of the sample mean and sample percentage on the sample size. The exercises are dynamic: The data tend to change when you reload the page.
The formulae in this chapter are for simple random sampling, and, in some cases, sampling with replacement with equal probability of selecting each unit. They do not apply to any other sampling design. In particular, unless the sample is selected at random, the error in the sample mean cannot be decomposed into bias and sampling error—there is no way to quantify "the luck of the draw." If the sample is selected at random, but not by simple random sampling or sampling with replacement, other formulae can be derived to quantify the bias and standard error of the sample mean.
For example, if the sample is drawn at random in such a way that every unit in the population has the same chance of being in the sample, the sample mean will be an unbiased estimator of the population mean. However, the SE of the sample mean depends crucially on the sample design.
An estimator is a statistic, a number calculated from a sample to estimate a population parameter. Consider an estimator X of a parameter t calculated from a random sample. The bias of the estimator X is the expected value of (X−t), the expected difference between the estimator and the parameter it is intended to estimate. If the bias of an estimator is zero, the estimator is unbiased; otherwise, it is biased. The bias of an estimator is the long-run average amount by which it differs from the parameter in repeated sampling. Any estimator can be written as the sum of three terms: the parameter it is intended to estimate, the bias of the estimator, and a chance error that has expected value zero. The square-root of the expected value of the square of the chance error is the SE of the estimator. The SE measures the long-run scatter of the estimator in repeated sampling. Estimators can be compared using summaries of their expected error. One common summary of the error of estimators is the mean squared error (MSE), E( (X−t)^{2} ), the expected value of the square of the difference between an estimator and the parameter it is intended to estimate. However, there are countless other criteria for comparing estimators. The square-root of the MSE is the root mean squared error, (RMSE). The units of the RMSE are the same as the units of the estimator, so it is easier to interpret than the MSE. The RMSE of an estimator is equal to the square-root of the sum of the square of the bias of the estimator and the square of the SE of the estimator. The RMSE of an unbiased estimator is its SE.
The sample mean for random sampling with or without replacement is an unbiased estimator of the population mean. Consequently, the sample percentage φ for random sampling with or without replacement is an unbiased estimator of the population percentage. The SE of the sample mean of a simple random sample of size n from a box of N tickets labeled with numbers is f×SD(box)/n^{½}, where f=(N−n)^{½}/(N−1)^{½} is the finite population correction and SD(box) is the SD of the population of numbers on the tickets in the box. For random sampling with replacement, the SE of the sample mean is SD(box)/n^{½}. Because the SD of a 0-1 box is (p(1−p))^{½}, where p is the fraction of tickets in the box labeled "1," the SE of the sample percentage φ of a simple random sample of size n from a box of N tickets labeled with numbers is
SE(φ) = f×(p(1−p))^{½}/n^{½}.
For random sampling with replacement,
SE(φ)=(p(1−p))^{½}/n^{½}.
It is rare that SD(box) is known in a situation where one desires to estimate the population mean or population percentage, so these formulae are not directly useful. However, there are ways to estimate SD(box). For any 0-1 box, SD(box)≤50%, which gives convenient upper bounds on SE(φ): For simple random sampling, SE(φ)≤f×50%/n^{½}, and for random sampling with replacement, SE(φ)≤50%/n^{½}. These upper bounds can be quite pessimistic. There is no upper bound on the SD of an arbitrary list of numbers, and thus no upper bound on the SE of the sample mean of random draws from an arbitrary box. However, SD(box) can be estimated from the sample if the sample size is large. The bootstrap estimate of SD(box) is s^{*}, the SD of the sample. For 0-1 boxes, the SD of the sample is s^{*}=(φ(1−φ))^{½}, where φ is the sample percentage. When the sample size is large, s^{*} tends to be an accurate estimate of SD(box), and the corresponding estimate of the SE of the sample mean, f×s^{*}/n^{½} (for simple random sampling) or s^{*}/n^{½} (for random sampling with replacement) tends to be accurate. The sample standard deviation s=s^{*}×(n/(n−1))^{½} is a bit larger than the SD of the sample. When the sample size n is large, s and s^{*} are almost equal. For random sampling with replacement, s^{2} is an unbiased estimate of the square of the SD of the list of numbers in the box. (The square of the SD of the list is called the variance of the list.) It is more common to use the bootstrap estimate s^{*} of SD(box) for estimating the SE of the sample percentage and to use the sample standard deviation s to estimate SD(box) for estimating the SE of the sample mean. These formulae apply only for simple random samples and random samples with replacement, not for other probability samples or non-probability samples.