The Normal Curve, the Central Limit Theorem, and Markov's and Chebychev's Inequalities for Random Variables

In many situations it is not practical or not possible to calculate probabilities exactly. This chapter presents three ways to approximate the probability that a random variable falls in a particular range of values, i.e., to approximate the area of part of a probability histogram: the normal approximation, Markov's Inequality for random variables, and Chebychev’s inequality for random variables.

The normal approximation approximates a probability by the area under part of a special curve, the normal curve. The appropriate part of the normal curve is found by transforming values of the random variable to standard units—the number of standard errors above the expected value. The normal approximation is accurate for many probability distributions. The Central Limit Theorem asserts that the normal approximations to the probability distributions of the sample sum and sample mean of independent random draws with replacement from a box of numbered tickets improve as the number of draws grows, no matter what numbers are on the tickets in the box.

The normal approximation is not accurate for every random variable. There are universal inequalities that limit the probability that a random variable falls in various ranges, even when the normal approximation is not accurate. The inequalities presented in this chapter are analogues of Markov's Inequality and Chebychev's Inequality for lists. Markov's Inequality for random variables limits the probability that a nonnegative random variable exceeds any multiple of its expected value. Chebychev's inequality for random variables limits the probability that a random variable differs from its expected value by any multiple of its SE. The normal approximation and Chebychev's inequality are the foundations of inferential techniques developed in

The Normal Approximation

Some probabilities are hard to compute exactly. For example, to calculate the chance of drawing 1,000 or fewer tickets labeled "1" in 10,000 draws without replacement from 0-1 box that contains 100,000 tickets of which 30,000 are labeled "1" would require summing 1001 terms, each of which involves ratios of combinations of 30,000, 70,000, and 100,000 things. Computing the terms without causing overflow or underflow is a challenge. Suppose we want to know the probability that the sample mean of independent random draws from a box of numbered tickets falls in some range, but we do not know the labels on all the tickets, only their mean and SD. Could we find the probability? This section develops a tool that allows us to solve both problems—approximately. The tool, called the normal approximation, involves two steps: transforming the range of values whose probability is sought into standard units, and finding the area under a special curve, the normal curve, over the transformed range. For some random variables, that area is close to the actual probability of the original range; for others, it is not. If the random variable is the sample sum or sample mean of a large number of independent random draws from a box of numbered tickets, the normal approximation tends to be accurate; the accuracy increases as the number of draws increases.

Standard Units for Random Variables

In we learned to transform a list to standard units (standardize the list) by subtracting the mean of the list from each element of the list to get the list of deviations from the mean, then dividing the list of deviations from the mean by the standard deviation of the original list. A list element expressed in standard units is the number of standard deviations by which the element exceeds mean. The mean of a list in standard units is zero, and the SD of a list in standard units is unity. Transforming to standard units is an affine transformation:

(original value) − (original mean) original value original mean
(standardized value) = ---------------------------------------- = ----------------------  −  ---------------------- .
original SD original SD original SD

 

Similarly, we can standardize a random variable X by subtracting its expected value E(X) and dividing by its standard error SE(X):

 

X − E(X) X E(X)
(X in standard units) = --------------- = ---------------  −  --------------- .
SE(X) SE(X) SE(X)

This is an affine transformation of X. The expected value of a random variable in standard units is zero, and the SE of a random variable in standard units is unity.

illustrates converting a value of a random variable to standard units. The example is dynamic: it will tend to change whenever you reload the page.

The Normal Curve

The Normal Curve is the familiar "bell-shaped" curve many people associate with Statistics. It is also called the Gaussian curve. shows a section of the normal curve from −5 to 5. The curve is positive everywhere, but gets small rather quickly as x moves away from 0: The area under the curve between minus infinity and −5 and between 5 and infinity is about 0.00000057. The scrollbars and text boxes in let you highlight any range of values within ±5 and see the area under the normal curve for the highlighted range.

The normal curve has the form

y = (2×π)−½ ×e−x2/2.

In this definition, π is the ratio of the circumference of a circle to its diameter, 3.14159265…, and e is the base of the natural logarithm, 2.71828… .

The normal curve depends on x only through x2. Because (−x)2 = x2, the curve has the same height y at x as it does at −x, so the normal curve is symmetric about x=0. The total area under the normal curve is unity, just as the total area under a histogram must be. Because the curve is symmetric, that implies that the area under the curve from minus infinity to 0 is the same as the area from 0 to infinity, namely, ½. Because the curve is symmetric around zero, it balances at zero (to use the analogy we used for the histogram). If you think of the normal curve as a histogram, it would correspond to a distribution whose mean equals zero. The SD of the normal curve, suitably defined, is unity.

The normal curve turns out to be a good approximation to many probability histograms, in the sense that the area under the probability histograms over a given range of values is close to the area under the normal curve over that range of values transformed to the corresponding range of standard units, as we shall see presently.

The area under the normal curve between minus infinity and x is

100% − (area under the normal curve between x and infinity).

(This is essentially the Complement Rule—the area under the entire curve is 100%; that area is the sum of the area under the curve to the left of x and the area under the curve to the right of x.) By symmetry, for x≥0, the area under the normal curve between −x and +x is

100% − 2×(area under normal curve between x and infinity).

 

Table contains some facts to commit to memory:

Areas under the normal curve
The area under the standard
normal curve between ±
is approximately
1 0.68
2 0.95
3 0.997

Many books tabulate areas under the normal curve between ±x or between minus infinity and x for closely spaced values of x.

To find other areas, you can either use the applet in (it is also available from the Tools page), or you can calculate the areas using the tabulated values and the following facts:

illustrates this approach.

Suppose we want to know the area under the normal curve between 2 and infinity. That is half of the area between minus infinity and −2 and between 2 and infinity, by symmetry. That, in turn, is the entire area under the curve (namely, 1), minus the area between −2 and 2, which is 0.95, so the area we want is (½)× (1−0.95) = 0.025. In pictures, we have:

= half of

− half of

Similarly, the area under the curve between −1.5 and +2.3 is half the area between ±1.5 plus half the area between ±2.3:

= half of

+ half of

The Normal Approximation to Probability Histograms

The normal curve approximates many probability histograms accurately, in the following sense: The area under the probability histogram over any given range of values is close to the area under the normal curve for that range of values, if the range is measured in standard units.

When the normal curve approximates a probability histogram well, the expected value and the SE of the distribution are a nearly complete description of the distribution. More typically, the expected value and SE are not enough information to characterize a probability distribution well.

lets us look at the empirical distribution of the sample sum or sample mean of draws at random with replacement from a box of tickets, this time, with the normal curve superposed on the histogram of the values of the sample sum or sample mean. The scale for the normal curve corresponds to standard units, but only the original units are plotted.

When you first open this chapter, the box in will contain 5 tickets, labeled "0," "1," "2," "3," and "4." The sample size will be set to 3, so each time you click the Take Sample button, the computer will draw three tickets at random, independently, with replacement from the box, and update the histogram to include the corresponding value of the sample sum. Click the Take Sample button a few times to get the feel of the tool, then increase the Samples to Take to 1000, and take ten thousand samples of size 3. The histogram of the values of the sample sum will then approximate the probability distribution of the sample sum of 3 tickets drawn with replacement from the box reasonably well (according to the Law of Large Numbers). Use the scrollbars to highlight different parts of the histogram, and compare the area under the histogram with the area under the corresponding part of the normal curve. The two will agree roughly, but not extremely well.

Now change the Sample Size to 100 and repeat the experiment: draw 10,000 samples of size 100, so that the empirical distribution of the sample sum is a good approximation to the probability distribution of the sample sum. Compare the area under the histogram in various ranges with the area under the normal curve in the same ranges. You should find that they agree remarkably well.

Try changing the numbers in the box (delete the numbers and type new numbers over them), and repeat the experiment. Regardless of what numbers you put in the box, if the sample size is small, the normal curve does not approximate the distribution of the sample sum well, but if the sample size is large, the normal curve approximates the probability histogram of the sum of the draws quite well. Use the drop-down list to toggle from the sample sum to the sample mean and repeat the experiment; you should find that the results are the same. Both for the sample sum and for the sample mean, the sample size required for the normal curve to approximate the probability histogram with a specified accuracy depends on the numbers on the tickets in the box. Because the random variable is converted to standard units before comparing the histogram with the normal curve, the expected value and SE of the random variable do not by themselves affect the accuracy of the normal approximation—they cancel out of the approximation.

For example, if the normal approximation to the probability histogram of X is accurate, so is the normal approximation to the probability histogram of aX+b. Because the expected value of the sample mean is Ave(box) and the SE of the sample mean is SD(box)/n½, this means that the accuracy of normal approximation to the probability histogram of the sample mean does not depend directly on the mean or SD of the list of numbers on the tickets in the box. However, the accuracy depends critically on the skewness of the distribution of the numbers on the tickets in the box.

The Central Limit Theorem

The probability histograms of the sample sum and sample mean of n independent draws from a box of tickets labeled with numbers are approximated increasingly well by a normal curve as n increases, in the sense that the area under the histogram between a and b is increasingly close to the area under the normal curve between a converted to standard units and b converted to standard units.

For example, as n increases, the chance that the sample mean of n independent draws from a box of tickets labeled with numbers is between a and b converges to the area under the normal curve between

n½×(a − Ave(box))/SD(box)    and    n½×(b − Ave(box))/SD(box),

where Ave(box) is the mean of the labels on the tickets in the box, and SD(box) is the standard deviation of the labels on the tickets in the box.

The accuracy of the normal approximation to the probability distribution of the sample sum or sample mean of draws at random with replacement from a box of numbered tickets depends on the distribution of the numbers on the tickets in the box.

The accuracy is better when the distribution of numbers on the tickets in the box is symmetric (versus skewed), unimodal (versus multimodal), and smeared out (versus "chunky"). The accuracy does not depend directly on the number of tickets in the box or on the mean or SD of the numbers on the tickets.

For any given box of numbered tickets, the accuracy of the normal approximation tends to improve as the number of draws increases. But the number of draws needed to attain any particular level of accuracy depends on the distribution of numbers in the box.

The sample sum of n independent draws with replacement from a box that contains a fraction p of tickets labeled "1" and a fraction (1−p) of tickets labeled "0" is a special case. That sample sum has a binomial distribution with parameters n and p. When n is large, the Central Limit Theorem says the binomial probability histogram is approximated well by the normal curve after transforming the number of successes to standard units by subtracting the expected number of successes, np, and dividing by the SE of the number of successes, (np(1−p))½.

The Continuity Correction

In approximating the probability histogram of a discrete random variable by the normal curve, it usually helps to adjust the endpoints of the range to reflect the possible values of the discrete random variable. For example, in approximating the binomial probability histogram by the normal curve, one can get more accurate answers by finding the area under the normal curve corresponding to a slightly different range, ending at half-integers, transformed to standard units. The improvement is easiest to see in the normal approximation to the chance of a single number of successes, rather than a range of numbers of successes.

Suppose we seek to approximate the chance of 10 successes in 25 independent trials, each with probability p = 40% of success, using the normal approximation. The number of successes has a binomial distribution with parameters n = 25 and p = 40%. The expected number of successes is np = 10, and the standard error of the number of successes is

(np(1−p))½ = 6½ = 2.45.

The area under the normal curve at the point 10 successes, transformed to standard units, is zero: The area under a point is always zero.

We get a better approximation by considering 10 successes to be the range from 9½ to 10½ successes. The only possible number of successes between 9½ and 10½ is 10, so the two probabilities are equal for the binomial distribution. Because the normal curve is continuous and a binomial random variable is discrete, we need to "smear out" the binomial probability over an appropriate range. The lower endpoint of the range, 9½ successes, is (9.5 − 10)/2.45 = −0.20 standard units. The upper endpoint of the range, 10½ successes, is (10.5 − 10)/2.45 = +0.20 standard units. The area under the normal curve between −0.20 and +0.20 is about 15.8%. The true binomial probability is 25C10×(0.4)10×(0.6)15 = 16%.

Similarly, if we seek the normal approximation to the probability that a binomial random variable is in the range from i successes to k successes, inclusive, we should find the area under the normal curve from i−½ to k+½ successes, transformed to standard units. To find the approximate probability of more than i successes and fewer than k successes, we should find the area under the normal curve corresponding to the range i+½ to k−½ successes, transformed to standard units. To find the approximate probability of more than i but no more than k successes, we should find the area under the normal curve corresponding to the range i+½ to k+½ successes, transformed to standard units. To find the approximate probability of at least i but fewer than k successes, we should find the area under the normal curve corresponding to the range i−½ to k−½ successes, transformed to standard units. Including or excluding the half-integer ranges at the ends of the interval in this manner is called the continuity correction.

The continuity correction is built into , because you can only highlight regions ending at half-integers.

When is the normal approximation to the binomial reasonable? The approximation is best when n is large and p is near 50%. If n is small, the binomial probability histogram is too "chunky" to be approximated well by a smooth curve. If p is too close to 0% or to 100%, the binomial probability histogram is too skewed to be approximated well by the normal curve, which is symmetrical. For small values of n, you might as well just compute the exact binomial probability—calculating the normal approximation is about as difficult. The normal approximation is reasonably accurate when np > 5 and n(1 − p) > 5, if the range of values whose probability is sought is near the expected value. In the "tails" of the probability histogram, far from the expected value, the approximation generally is not as accurate. To make the approximation as accurate as possible, use the continuity correction. Note that for n = 10, p = 50%, the normal approximation with the continuity correction is accurate to about 0.4%.

The following exercises check your ability to use the normal approximation with the continuity correction.

The Normal Approximation to the Hypergeometric Distribution

Recall that the number of "good" objects in a simple random sample of n objects from a population of N objects of which G are good has the hypergeometric distribution with parameters N, G, and n. We saw in previous chapters that the expected value of an hypergeometric random variable is n×G/N, and the SE of an hypergeometric random variable is

( (N−n)/(N−1) )½ × (n×G/N × (1−G/N) )½.

If we transform to standard units using these values of the expected value and SE, the normal approximation to the resulting probability histogram is reasonably good if we use the continuity correction. Again, the approximation tends to be better near the expected value, and tends to be worse far from the expected value in the tails of the distribution.

In a certain population of 1000 students, 100 have driven a car above the speed limit in the last two weeks. We plan to take a simple random sample of 50 students.
(a) What is the expected number of students in the sample who exceeded the speed limit?
(b) What is the SE of the number of students in the sample who exceeded the speed limit?
(c) What is the chance that at least 4 and no more than 6 of the students in the sample are among the 100 who sped?
(d) What is the normal aproximation to the chance in (c)?
(e) What is the chance that 10 or more in the sample are among the 100 who exceeded the speed limit?
(f) What is the normal approximation to the chance in (e)?

Solution.
(a) The expected number is n×G/N = 50×100/1000 = 5.
(b) The SE of the number is ( (N−n)/(N−1))½ × (n × G/N × (1−G/N) )½ = (950/999)½ × (50 × 0.1 × 0.9)½ = 2.07.
(c) The chance that between 4 and 6 students in the sample, inclusive, are among the 100 who sped is the sum of the chances that exactly 4, 5, or 6 students sped, because these possibilities are mutually exclusive and exhaustive.

P(at least 4 and no more than 6 students in the sample sped)
= P(exactly 4 in the sample sped) + P(exactly 5 in the sample sped) + P(exactly 6 in the sample sped)
= 100C4 × 900C46/1000C50 + 100C5 × 900C45/1000C50 + 100C6 × 900C44/1000C50
= 53.13%

We could also get this from the probability calculator in . To solve the problem using the calculator, select Hypergeometric from the drop-down menu at the top of the figure; set the Population Size to 1,000, #Good in Population to 100, and Sample Size to 50. Tic the box in front of X≥ and set that limit to 4. Tic the box in front of X≤ and set that limit to 6. The display should show that the probability is 53.13%.

(d) To use the normal approximation, we first want to find the continuity correction. The range "4 to 6 speeders in sample" is the same range of possibilities as "3.5 to 6.5 speeders." In standard units, 3.5 transforms to (3.5 − 5)/2.068 = −0.725 standard units, and 6.5 transforms to (6.5 − 5)/2.068 = 0.725 standard. The area under the normal curve between −0.725 and +0.725 is 53.2%:

In this case, the normal approximation is accurate to 0.07%, with a relative accuracy of (53.2% − 53.13%)/53.13% = 0.13%.

(e) The chance that 10 or more in the sample are among the 100 who sped is

100% − (chance that 9 or fewer in the sample are among the 100 who sped) =
100% − P(0 in sample sped) − P(1 in sample sped) − … − P(9 in sample sped) =
100% − 100C0 × 900C50/1000C50100C1 × 900C49/1000C50 − … − 100C9 × 900C41/1000C50
= 2.14%.

To use to solve to solve this problem, change the X≥ text box to 10 and un-check the box that precedes X≤. The display should show that the probability is 2.144%.

(f) We are interested in the area of the probability histogram from 10 to 50. Because we want to include 10, the continuity correction would have us start the range at 9.5. In standard units, the range of interest is thus from

(9.5 − 5)/2.068 = 2.176 standard units

to infinity. The area under the normal curve for that range is shown in to be 1.5%, which has an absolute error of only 0.64%, but a relative error of (2.14 − 1.5)/2.14 = 30%. The relative error of the normal approximation tends to be largest in the "tails."

Markov's and Chebychev's Inequalities for Random Variables

We have just seen that the normal curve is a good approximation to the probability distribution of some random variables, in particular, the sample sum and sample mean of a large number of independent draws from a box of tickets labeled with numbers. However, it is not a good approximation to every probability distribution, and typically it is not possible to tell whether or not the normal approximation is accurate without computing something that requires knowing the distribution of the random variable. In contrast, Markov's inequality is true for every random variable that must be at least zero, and Chebychev's inequality is true for every random variable that has finite expected value and finite standard error.

In we saw that Markov's Inequality for a list says that if every element of a list is zero or larger, then for every a>0,

(fraction of elements in the list that are greater than or equal to a) ≤ (mean of list)/a.

Markov's inequality for random variables is directly analogous: If the chance that X≥0 is 100% (if X is a nonnegative random variable) then for every a>0,

P(X ≥ a) ≤ E(X)/a.

Chebychev's inequality for lists says that

(fraction of elements in the list that are k or more SDs away from the mean of the list) ≤ 1/k2.

Chebychev's inequality for random variables is again analogous:

P(X is k or more SEs away from E(X)) ≤ 1/k2.

Equivalently,

P(|X − E(X)| ≥ k×SE(X)) ≤ 1/k2.

Chebychev's inequality can be derived from Markov's inequality. The Law of Large Numbers, introduced in can be proved using Chebychev's Inequality.

The following exercises check your ability to apply Markov's Inequality and Chebychev's Inequality for random variables. In some problems, it is possible to apply both Markov's Inequality and Chebychev's Inequality. When that happens, you should use the better bound. Recall that one upper bound is better than another upper bound if it is smaller, and one lower bound is better than another lower bound if it is larger. The exercises are dynamic: The wording and the data tend to change when you reload the page.

An Example of Exercise 23-4 (Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Summary

The normal curve, y=(2π)−½e−x2/2, is symmetric about zero, where it has a single bump. The total area under the normal curve is 100%. The area under the normal curve between ±1 is about 68%; the area under the normal curve between ±1.96 is about 95%, and the area under the normal curve between ±3 is about 99.97%. Standard units for random variables are analogous standard units for lists. A value of a random variable in standard units is the number of SEs by which it exceeds the expected value of the random variable; the value of an element of a list in standard units is the number of SDs by which it exceeds the mean of the list. To transform a random variable X to standard units, subtract E(X) and divide the result by SE(X). Just as the mean of a list in standard units is zero and the SD of a list in standard units is 1, the expected value of a random variable in standard units is 0 and the SE of a random variable in standard units is 1.

The probability distributions of some random variables can be approximated by the normal curve, in the sense that the area under the probability histogram for any range of values of the random variable is approximately equal to the area under the normal curve for the same range of values transformed to standard units. This is called the normal approximation to the probability distribution. The normal approximation to the probability distribution of the sample sum and sample mean of n independent random draws with replacement from a box of numbered tickets grows increasingly accurate as n increases. This is the Central Limit Theorem. The accuracy of the normal approximation to the distribution of the sample sum and sample mean depends not only on n—it depends on the numbers on the tickets too. In general, the more skewed the distribution of the numbers on the tickets in the box, the larger n must be for the normal approximation to have a given accuracy. In the special case of a 0-1 box, the Central Limit Theorem implies that binomial probability distributions can be approximated increasingly well by a normal curve as n grows. How large n must be for the normal approximation to have a particular accuracy depends on p, the fraction of tickets in the box labeled "1:" To attain a given level of accuracy, n can be smaller if p is close to 50% than if p is close to 0 or to 100%. The normal approximation to the hypergeometric distribution is accurate if the sample size n is large, but still small compared with the population size N. The approximation is most accurate when G/N, the fraction of tickets in the box labeled "1," is close to 50%.

Even when the normal approximation to a probability distribution is not accurate, the expected value and SE of the random variable contains a great deal of information about the probability distribution of the random variable, just as the mean and SD of a list contain a great deal of information about the distribution of the list. Markov's Inequality and Chebychev's Inequality express some of that information. Markov's inequality for random variables says that if P(X≥0)=100% (if X is a nonnegative random variable), and the expected value of X is finite, then for any a>0,

P(X≥a) ≤ E(X)/a.

Chebychev's Inequality for random variables says that if SE(X) is finite, then for any k>0,

P(|X − E(X)| ≥ k×SE(X)) ≤  1/k2.

The normal approximation and Chebychev's Inequality will be used later in the book to draw inferences about populations from random samples.

Key Terms