This chapter continues our study of estimating population parameters from random samples. In chapter 17, "Estimating Parameters from Random Samples," we studied estimators that assign a number to each possible random sample, and the uncertainty of such estimators, measured by their RMSE. (The RMSE is the square-root of the expected value of the squared difference between the estimator and the parameter—a measure of the typical size of the error.) Instead of assigning a single number to each sample and reporting the size of a typical error, the methods in this chapter assign an interval to each sample and report the confidence level that the interval contains the parameter. Confidence is a technical term related to probability. Just as the RMSE of an estimator measures the long-run average size of the error in repeated sampling, but the error for any particular sample could be smaller or larger than the RMSE, the confidence level is the long-run fraction of intervals that contain the parameter in repeated sampling, but the interval for any particular sample might or might not contain the parameter.
The statement "the interval [92%, 94%] contains the population percentage at confidence level 90%" does not mean that the probability that the population percentage is between 92% and 94% is 90%. (The event that the interval [92%, 94%] contains the population percentage is not random: either the population percentage is between 92% and 94%, or it is not.) Rather, the statement means that if we were to take samples of size n repeatedly and compute a 90% confidence level confidence interval for the population percentage from each sample of size n, the long-run fraction of intervals that contain the population percentage would converge to 90%.
The length of the confidence interval and the confidence level measure how accurately we are able to estimate the parameter from a sample. If a short interval has high confidence, the data allow us to estimate the parameter accurately. Higher confidence generally requires a longer interval, ceteris paribus, and, shorter intervals generally have lower confidence levels. Conventional values for the confidence level of confidence intervals include 68%, 90%, 95%, and 99%, but sometimes other values are used.
In this section, we develop conservative confidence intervals for the population percentage based on the sample percentage, using Chebychev’s Inequality and an upper bound on the SD of lists that contain only the numbers 0 and 1. "Conservative" means that the chance that the procedure produces an interval that contains the population percentage is at least large as claimed.
Consider a 0-1 box of N tickets. The population percentage p is the fraction of tickets labeled "1:"
p = 100% × (# tickets in the population labeled "1")/N,
The population percentage is also the population mean of the numbers on all the tickets in the box, ave(box). The sample percentage φ of a simple random sample (random sample without replacement) of size n from the population of N tickets is
φ = 100% × (# tickets in the sample labeled "1")/n.
The sample percentage is the sample mean of the labels on the tickets in the sample. The expected value of the sample percentage φ is the population percentage p, and the SE of the sample percentage φ is
SE(φ) = f × ( p×(1−p) )½/n½
≤ f ×50%/n½,
where f is the finite population correction
f = (N −n)½/(N − 1)½.
Thus f ×50%/n½ is an upper bound on the SE of the sample percentage.
shows what happens if we center an interval at the sample percentage, and extend the interval down and up from the sample percentage by twice the upper bound on the SE of the sample percentage. When the interval includes the population percentage, we say the interval covers the truth. The interval is random, because it is centered at the sample percentage, which is random. The chance that the random interval will contain the true population percentage is called the coverage probability of the interval. Take a few samples by pressing the "Take Sample" button to get the feel of the tool; then increase the "Samples to Take" to 1000 and click the Take Sample button again. The actual percentage of intervals that cover will vary, but almost always it will be larger than 75%, sometimes nearly 100%. The empirical percentage of intervals that cover is an estimate of the coverage probability of the procedure. Vary the sample size and put a few different lists of zeros and ones into the Population box at the right of the figure, and try a few different sample sizes for each population. You should find that the fraction of intervals that cover the true population percentage stays above 75% (almost without fail), no matter what the population of zeros and ones is.
You need Java to see this.
Why do these random intervals cover the true population percentage so often? We can show that they should using Chebychev's inequality. Because
SE(φ) ≤ f × 50%/n½,
the event
| φ − p | ≤ k × SE(φ)
is a subset of the event
| φ − p | ≤ k × f × 50%/n½.
It follows that
P( | φ − p | ≤ k × SE(φ) ) ≤ P( | φ − p | ≤ k × f × 50%/n½ ).
Chebychev's inequality guarantees that the chance the sample percentage φ differs from its expected value p by more than k times its standard error is at most 1/k2, so
1 − 1/k2 ≤ P ( |φ − p| ≤ k×SE(φ) )
≤ P( |φ − p| ≤ k × f × 50%/n½ ).
That is,
P( |φ − p| ≤ k × f × 50%/n½ ) ≥ 1 − 1/k2.
Therefore, in the long run in repeated sampling, the fraction of trials in which the sample percentage φ is within ±2×f×50%/n½ of the population percentage p converges to a number that is 75% or larger. Whenever φ is within ±2×f×50%/n½ of the population percentage p, an interval centered at φ extending down and up by ±2×f×50%/n½will contain p. That is, the interval
φ ± 2× f × 50%/n½,
which is shorthand for
[ φ − 2 × f × 50%/n½, φ + 2 × f × 50%/n½ ],
contains p at least 75% of the time, in the long run. Similarly, the fraction of trials in which φ is within ±3×f×50%/n½ of p converges to a number that is 88.89% or larger, so the long-run fraction of intervals φ±3×f×50%/n½ that contain p will be 88.89% or larger. The fraction of trials in which φ is within ±4×f×50%/n½ of p converges to a number that is 93.75% or larger, so the long-run fraction of intervals φ±4×f×50%/n½ that contain p will be 93.75% or larger, etc.
Change the "Intervals: ±" value in to 3 and to 4 to confirm empirically that this is true.
The interval φ±k×f×(50%/n½) is random: Its center depends on φ, which in turn depends on which units (here, tickets) happen to be in the random sample. The probability is in the random sampling procedure, not in the parameter. The parameter is the same, no matter what sample we happen to get—the parameter is a property of the population, not the sample. It is the interval that varies with the random sample. Before the data are collected, the coverage probability is the chance that sampling will result in an interval that contains the parameter.
Taking the sample determines the interval, leaving nothing to chance: The interval the procedure produced either does or does not contain the population percentage. (One could say that after collecting the data, the chance that the interval covers the parameter is either 0 or 100%.) Typically, we never learn whether the interval covers the parameter, but our ignorance is not a probability (at least, not according to the frequency theory of probability used in this book).
The interval the procedure gives for any particular set of data is called a confidence interval. The confidence level of a confidence interval is equal to the coverage probability of the procedure before the data are collected.
Confidence is a word statisticians reserve for this idea. If, before collecting the data, the procedure we are using has a P% chance of producing an interval that covers the true population percentage, then, after collecting the data, the interval the procedure produced is called a P% confidence interval.
Coverage Probability and Confidence Level
Consider a population parameter, and a procedure that produces random intervals. Suppose that the probability that the procedure produces an interval that contains the parameter is P %.
In repeated sampling, about P% of confidence intervals with confidence level P% will contain (cover) the parameter. About (100−P)% of the intervals will not cover the parameter. For any particular sample, unless the population parameter is known, we will not know whether the confidence interval covers the parameter.
Chapter 17, we summarized the uncertainty of an estimate of a parameter by the mean squared error or root mean squared errorof the estimator, which are measures of the average error of the estimator in repeated sampling. A confidence interval is a different way of expressing the uncertainty in an estimate: a range of values that contains the parameter with specified confidence level.
The interpretation of confidence level for a particular interval is analogous to the interpretation of RMSE for a particular value of the estimate: The RMSE is the square-root of the long-run average squared error of the estimator in repeated sampling, but for any particular sample, the error could be larger or smaller than the RMSE—and we will not know which unless we know the true value of the parameter. The confidence level measures the long-run fraction of intervals that contain the parameter in repeated sampling, but for any particular sample, the confidence interval either will or will not contain the parameter—and we will not know which unless we know the true value of the parameter.
We can use the approach developed in this section to construct confidence intervals for the population percentage p with other nominal confidence levels, by extending the interval up and down from the sample percentage φ by larger or smaller amounts. The longer the intervals, the larger the nominal confidence level—the larger the chance that an interval will contain p. The shorter the intervals, the smaller the chance that an interval will contain p. In particular, if we choose k so that
1 − 1/k2 = P %,
then the interval
[ φ − k × f× 50%/n½, φ + k × f× 50%/n½ ]
is a (nominal) P % confidence interval for the population percentage p.
The actual coverage probability of the interval
is greater than (1 − 1/k2), for two reasons. First, the standard error of the sample percentage φ is less than f×(50%/n½) unless the population percentage p is 50%. Second, the distribution of the sample percentage is that of an hypergeometric random variable divided by the sample size, n, and such a distribution cannot attain the bound in Chebychev's inequality: Even for the true SE of the sample percentage,
SE(φ) = f × ( p × (1−p) )½/n½,
P( | φ − p | < k × SE(φ) ) > 1 − 1/k2.
As a result, confidence intervals for the population percentage based on Chebychev's inequality and the upper bound of 50% for the SD of a list of zeros and ones are conservative: the actual confidence level is greater than the nominal confidence level, (1 − 1/k2). The next section develops a procedure that is not conservative, but that is approximate: The confidence level could be larger or smaller than the nominal level. (The nominal confidence level is close to the actual confidence level when the sample size n is large.)
Whenever you use a confidence interval, it crucial to state the confidence level. Otherwise, it is impossible to interpret the result. The choice of the confidence level is essentially arbitrary, but the choice should be made before collecting the data. Common values of the confidence level are 68%, 90%, 95%, and 99%. There is a tradeoff between precision (the length of the confidence interval), and confidence level: Higher confidence levels require longer confidence intervals.
The following exercise checks your ability to compute a conservative confidence interval for the population percentage.
Confidence intervals for the population percentage based on Chebychev's inequality and the upper bound of 50% for the SD of lists of zeros and ones are conservative: Their true confidence level is greater than their nominal confidence level, (1 − 1/k2). We could use shorter intervals and still have confidence level (1 − 1/k2), or we could claim a confidence level higher than (1 − 1/k2).
How much shorter could the interval be, or how large a confidence level could we claim? It is possible to figure these things out precisely, but we shall follow a standard approximate approach instead, one that we can extend to other situations. We shall use the central limit theorem to develop a procedure that produces shorter confidence intervals for a given nominal confidence level. The new procedure will be approximate instead of conservative: the coverage probability will be close to the nominal coverage probability when the sample size is large, but could be smaller or larger depending on the population percentage, and could be quite different from the nominal coverage probability for small samples from pathological populations.
We shall assume throughout the rest of this chapter that either
With this assumption, we can neglect the finite population correction and act as if the tickets in the sample were drawn independently. (See Chapter 14, "Standard Error"). When the tickets are drawn independently, the central limit theorem tells us that as the sample size grows, the normal curve is a better and better approximation to the probability histogram of the sample percentage (and to the probability histogram of the sample mean). The normal approximation to the probability that the sample percentage is in the interval
[p − 1.15×(p×(1−p))½/n½, p + 1.15×(p×(1−p))½/n½]
is equal to the area under the normal curve for the corresponding range of values in standard units, [−1.15, 1.15]. The area under the normal curve between −1.15 and 1.15 is about 75%:
This is much larger than the bound of (1 − 1/(1.15)2) = 24.4% that Chebychev's inequality gives. When the sample percentage φ is within
±1.15× ( p×(1−p) )½/n½
of p, p is within
of the sample percentage φ, so the probability that the interval
I = [ φ − 1.15 × ( p×(1−p) )½/n½, φ + 1.15 × ( p×(1−p) )½/n½ ]
contains the population percentage p is about 75%: The coverage probability of I is approximately 75%.
Unfortunately, we cannot construct I from the sample alone: the sample determines the center of I, but to find the length of I we need to know p×(1−p), which is tantamount to knowing p. If we knew p, we would not be estimating it.
If the sample size n is large, the sample standard deviation s
s = ( (n/(n−1)) × φ × (1 − φ) )½,
is likely to be close to the SD of the population; when that happens,
s/n½
is close to SE(φ), the standard error of the sample percentage. Therefore, if the sample size is large, but either the sample is small compared to the population or the sample is taken with replacement, the probability that the random interval
[ φ − 1.15 × s/n½, φ + 1.15 × s/n½ ]
contains the population percentage p is about 75%. This interval has not only a random center (the sample percentage φ), but also a random length (the length depends on the observed value of s, and s is random, because it depends on the random sample).
Figure lets you try the procedure yourself. Each time you click the Take Sample button, a sample is drawn with replacement from the numbers in the box on the right (initially set to a random list of zeros and ones). The sample size initially is set to 30. The controls at the bottom of the figure allow you to change the size of each sample, the number of samples that are taken each time you click the button, and the width of the interval, as a multiple of the estimated SE or the conservative bound on the SE. (The estimated SE is s/n½ because we are sampling with replacement; the bound is 0.5/n½.) A label in the bottom right corner reports the fraction of intervals that cover the population percentage. Intervals that cover are green; those that do not cover are red. A small black dot marks the middle of each interval (the sample percentage). A blue vertical line marks the true population percentage p.
Take a few samples to get the feel of the tool; then increase the Samples to take to 1000, and click the Take Sample button again. The actual percentage of intervals that cover will vary, but should be reasonably close to 75%. Increase Sample size to 200 and try again; the percentage of intervals that cover should be closer to 75%. Try putting a few different lists of zeros and ones into the Population box at the right of the figure, and try a few different sample sizes for each population. When the sample size is large, the fraction of intervals that cover the true population percentage will be very close to 75%.
The following exercises check your ability to compute conservative and approximate confidence intervals for the population percentage, and your ability to determine which method is more appropriate.
Recall that percentages are just means of special lists of numbers, lists that contains only zeros and ones. We can find confidence intervals for the means of more general lists of numbers, too.
Suppose that we seek a confidence interval for the mean of a population (box) of numbers, based on a random sample from the population. The sample mean is an unbiased estimator of the population mean (E(sample mean) = ave(box)), so it is reasonable to center a confidence interval at the sample mean. How wide should we make an interval centered at the sample mean, for the interval to have a specified probability of covering the population mean?
If we knew the SD of the population, we could use Chebychev's inequality in much the same way we did at the beginning of the chapter, because the standard error of the sample mean is
SE(sample mean) = SD(box)/n½,
where n is the sample size. So, for example, the coverage probability of the random interval
[(sample mean) - 2×SD(box)/n½, (sample mean) + 2×SD(box)/n½]
is at least 75%.
Typically, however, the SD of the population is not known, so we cannot construct this interval. Moreover, we cannot use the conservative approach we used for percentages, because there is no upper bound on the SD of a general list of numbers analogous to the upper bound of 50% for the SD of lists that contain only zeros and ones.
However, the approximate approach, based on the normal curve, still works if the sample size is sufficiently large. The central limit theorem tells us that the probability histogram of the average of n draws with replacement from a box follows the normal curve increasingly well as the number of draws n increases. We also know that the sample standard deviation s is increasingly likely to be an accurate estimate of the SD of the population as n increases. As a result, the probability that the sample mean is within ±z×s/n½ is approximately the same as the area under the normal curve between −z and z. For any fixed population (box), the approximation improves as the sample size n increases, for random sampling with replacement. Example illustrates calculating an approximate confidence interval for the population mean. The example is dynamic: It will tend to change when you reload the page.
The following exercise checks your ability to calculate approximate confidence intervals for the population mean. The exercise is dynamic: The question will tend to change when you reload the page.
We can also use a random sample with replacement to find a confidence interval for a percentile of a population. We shall work out the details for the median; other percentiles can be treated similarly. These intervals are rather different from than the confidence intervals earlier in this chapter, which were of the form (estimate ± uncertainty). Instead, the endpoints of the intervals are two of the data. Also, while the first approach to confidence intervals for the population percentage was conservative, and the second approach was approximate, this approach leads to exact confidence intervals: The nominal coverage probability is equal to the actual coverage probability.
To begin, suppose we have a random sample of size 10
{X1, X2, … , X10}
taken with replacement from a population with median m. Sort the data into increasing order: let X(1) be the smallest datum, X(2) be the second-smallest, etc., and let X(10) be the largest datum. (The sorted data are called the order statistics.) Let A1 be the event that the fourth-smallest datum, X(4), is less than or equal to the median, and let A2 be the event that the seventh-smallest datum, X(7), is greater than or equal to the median. The event A1 occurs unless 7 or more data are greater than the population median, so A1c is the event that 7 or more data are greater than the population median. Similarly, the event A2 occurs unless 7 or more data are less than the population median, so A2c is the event that 7 or more data are less than the population median. Let A=A1A2 be the event that the fourth and seventh order statistics bracket the median. We shall find a lower bound on the probability of A.
Note that if seven or more data are less than the median, then it is not the case that seven or more data are greater than the median, so A1c and A2c are disjoint. Hence,
P(Ac) = P((A1A2)c)
= P(A1c ∪A2c)
= P(A1c) + P(A2c),
and thus
P(A) = 1−P(Ac) = 1 − P(A1c) − P(A2c).
We are done if we can find upper bounds for P(A1c) and P(A2c).
Recall that the median is the smallest number that at least 50% of the population are less than or equal to. It follows that the probability that a number drawn at random from the population is strictly less than the median is at most 50% (and possibly less), and that the probability that a number drawn at random from the population is strictly greater than the median is at most 50% (and possibly less). The data are drawn from the population independently, so the number of data that are less than the population median has a Binomial probability distribution with n trials and p ≤ 50%, as does the number of data that are greater than the population median.
Let Y be a random variable with a Binomial distribution with parameters n=10 and p = 50%. Thus P(A1c) ≤ P(Y ≥ 7), and P(A2c) ≤ P(Y ≥ 7). However, P(Y ≥ 7) = P(Y ≤ 3), so
P(A) ≥ 1 − P(Y≤ 3 or Y ≥ 7) = P(4 ≤ Y ≤ 6).
Thus the probability that the interval [X(4), X(7)] contains the population median is at least as large as the probability of observing 4, 5, or 6 successes in 10 independent trials with probability 50% of succeess in each trial—the highlighted area in :
The interval from the fourth-smallest datum to the seventh-smallest datum is therefore a confidence interval for the population median.
The same idea can be used to find confidence intervals for other percentiles: The probability distribution of the number of data that are less than the 100×qth percentile is Binomial with number of trials equal to the number of data, n, and probability of success at most q, and the probability distribution of the number of data that are greater than the 100×qth percentile is Binomial with number of trials equal to the number of data, n, and probability of success at most 1−q.
The following exercise checks whether you can find a confidence interval for a population median.
Suppose we have a procedure for calculating an interval from every possible sample of size n from a population of size N (a box of N numbered tickets). Let t be a parameter of the population. Suppose that if the procedure is applied to a random sample of size n, the chance that the resulting interval will contain t is P%. Then the interval that results from applying the procedure to any particular random sample of size n is a P% confidence interval for t. Once the random sample has been drawn, the resulting interval either covers (contains) or does not cover t&mdashthe probability that the interval covers t is either 0 or 100%. The probability that the interval will cover t before the sample is drawn is called the confidence level of the interval after the sample is drawn. Confidence intervals provide an alternative to reporting a single "best estimate" of a parameter and a summary measure of the uncertainty of the estimate. It is possible to construct conservative confidence intervals for the population percentage from simple random samples or random samples with replacement from 0-1 boxes: For a simple random sample of size n, the chance that the random interval
[φ−k×f/(2n½), φ−k×f/(2n½)]
covers the population percentage p is at least 1−1/k2, where φ is the sample percentage, f is the finite population correction (N−n)½/(N−1)½, N is the population size, and n is the sample size. For random sampling with replacement, the chance that the random interval
[φ−k/(2n½), φ−k/(2n½)]
includes the population percentage p is at least 1−1/k2. These are conservative procedures for constructing confidence intervals, because the probability that the intervals they produce cover the true population percentage p (the actual coverage probability) is greater than the probability they claim, 1−1/k2 (the nominal coverage probability). These procedures can be extremely pessimistic, especially when the sample size n is large and when the true population percentage p is far from 50%&mdashthe intervals then are much wider than they need to be for the actual coverage probability to be 1−1/k2.
Suppose that the random sample is drawn with replacement. When the sample size n is large, the central limit theorem ensures that the probability histogram of the sample percentage can be approximated accurately by the normal curve. The expected value of the sample percentage φ is p and the SE of the sample percentage is SD(box)/n½, where SD(box) is the population SD (p×(1−p))½, the SD of the list of numbers on the tickets in the box. When n is large, the SD of the sample, s*, tends to be an accurate estimate of SD(box), and the chance that the random interval
[φ−z×s*/n½, φ+z×s*/n½]
contains p is approximately equal to the area under the normal curve between –z and z. Taking z=1.96, for example, gives approximate 95% confidence intervals. The coverage probability of this procedure typically is not exactly the area under the normal curve between ±z, but as the sample size grows, the coverage probability approaches that area.
Approximate confidence intervals for the population mean can be constructed similarly, but then it is more common to use
s=s*×n½/(n−1)½
to estimate SD(box) than to use s*. Let M denote the sample mean. For random sampling with replacement, if the sample size n is large, the chance that the random interval
[M–z×s/n½, M+z×s/n½]
covers the population mean is approximately equal to the area under the normal curve between ±z. Again, the coverage probability is not exactly the area under the normal curve between ±z, but it approaches that area as the sample size grows.
Confidence intervals can be constructed for population parameters other than percentages and means. For example, one can construct confidence intervals for percentiles of a population using the fact that for random sampling with replacement, the number of data that are less than the 100×qth percentile has a binomial distribution with parameters n and p=q, and the number of data that are greater than the 100×qth percentile has a binomial distribution with parameters n and p=1−q.