introduced a conceptual framework for statistical hypothesis testing. presented important statistical considerations for determining whether a treatment has an effect. Treatment is meant loosely—it could be a drug, an advertising campaign, a car wax, a test preparation course, a fertilizer, etc. The best way to determine whether a treatment has an effect is to use the method of comparison in an experiment in which subjects are assigned at random to the treatment group or the control group.

When the measurement of each subject can be represented by 0 or 1 (e.g., subject's condition improves or not, subject buys something or not, subject clicks a link or not, subject passes an exam or not), deciding whether the treatment has an effect is essentially testing the null hypothesis that two percentages are equal—which is the problem this chapter addresses.

Different ways of drawing samples lead to different tests. In one sampling design (the randomization model), the entire collection of subjects is allocated randomly between treatment and control, which makes the samples dependent. Conditioning on the total number of ones in the treatment and control groups leads to Fisher's exact test, which is based on the hypergeometric distribution of the number of ones in the treatment group if the null hypothesis is true. When the sample sizes are large, calculating the rejection region for Fisher's Exact Test is cumbersome, but the normal approximation to the hypergeometric distribution gives an approximate test—a test whose significance level is approximately what it claims to be.

In a second sampling design (the population model), the two samples are independent random samples with replacement from two populations; conditioning on the total number of ones in the two samples again leads to Fisher's exact test, which can be approximated as before.

There is another approximate approach to testing the null hypothesis in the population model: If the sample sizes are large (but the samples are drawn with replacement or are small compared to the two population sizes), the normal approximation to the distribution of the difference between the two sample percentages tends to be accurate. If the null hypothesis is true, the expected value of the difference between the sample percentages is zero, and the SE of the difference in sample percentages can be estimated by pooling the two samples. That allows one to transform the difference of sample percentages approximately into standard units, and to base an hypothesis test on the normal approximation to the probability distribution of the approximately standardized difference. Surprisingly, the resulting approximate test is essentially the normal approximation to Fisher's exact test, even though the assumptions of the two tests are different.

Fisher's Exact Test for an Effect—Dependent Samples

Suppose we own a start-up company that offers e-tailers a service for targeting their Web advertising. Consumers register with our service by filling out a form indicating their likes and dislikes, gender, age, etc. We store cookies on each consumer’s computer to keep track of who he is. When a consumer with one of our cookies visits the Web site of any of our clients, we use the consumer's likes and dislikes to select (from a collection of the client's ads) the advertisement we think he is most likely to respond to. This is called targeted advertising. The targeting service is free to consumers; we charge the e-tailers. We can raise venture capital if we can show that targeting makes e-tailers' advertisements more effective.

We offer our service free to a large e-tailer. The e-tailer has a collection of advertisements that it usually uses in rotation: Each time a consumer arrives at the site, the server selects the next ad in the sequence to show to the consumer; the cycle starts over when all the ads have been shown.

To test whether targeting works, we implement a randomized, controlled, blind experiment by installing our software on the e-tailer's server to work as follows: Each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software selects, and with probability 50% the server shows the consumer the next ad in the rotation—the control ad. The decision of whether to show a consumer the targeted ad or the control ad is independent from consumer to consumer. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. The consumers who were shown the targeted ad comprise the treatment group; the other consumers comprise the control group. If a consumer visits the site more than once during the trial period, we ignore all of that consumer's visits but the first. Each subject (consumer) is assigned at random either to treatment or to control, and no subject knows which group he is in, so this is a controlled, randomized, blind experiment. There is no subjective element to determining whether a subject purchased something, so the lack of a double blind does not introduce bias.

Suppose that N consumers visit the site during the trial, that n_t of them are assigned to the treatment group, that n_c of them are assigned to the control group, and that G of the consumers buy something. (The mnemonic is that c stands for control, t for treatment, and G for the number of good customers—customers who buy something.) We want to know whether the targeting affects whether subjects buy anything. Only some of the consumers see the targeted ad, and only some see the control ad, so answering this question involves hypothetical counterfactual situations—what would have happened had the all the consumers been shown a targeted ad, and what would have happened had all the consumers been shown a control ad. We treat the N consumers as a fixed group, without regard for how they were drawn from the more general population of people who shop online. Any conclusions we draw about the consumers who visited the site might not hold for the general population: We should be wary of extrapolating the results to consumers who were not in the sample unless we know that the randomized group is itself a random sample from the larger population. This set-up, in which the N subjects are a fixed group and the only random element is in allocating some of the subjects to the treatment group and the rest to the control group, is called the randomization model. Later in this chapter we consider a population model, in which the treatment group and the control group are random samples from a much larger population. In the population model, the null hypothesis will be slightly different, but we shall be able to extrapolate the results from the samples to the populations from which they were drawn, because they were drawn at random.

We can think of the experiment in the following way: The ith consumer has a ticket with two numbers on it: The first number, c_i, is 1 if the consumer would have bought something if shown the control ad, and 0 if not. The second number, t_i, is 1 if the consumer would have bought something if shown the targeted ad, and 0 if not. There are N tickets in all. Under the null hypothesis that targeting has no effect, t_i=c_i for each i=1, 2, … , N. That is, each consumer either will buy or will not buy, regardless of the ad he is shown: Whether he will buy is determined before he is assigned to treatment or control.

For the ith consumer, we observe either c_i or t_i, but not both. The percentage of consumers who would have purchased something if every consumer had been shown the control ads is

The percentage of consumers who would have bought something if all had been shown the targeted ads is

be the difference between the percentage of consumers who would have bought had all been shown the targeted ad, and the percentage of consumers who would have bought had all been shown the control ad. Under the null hypothesis that targeting does not make a difference, t_i = c_i for all i=1, 2, … , N. Thus if the null hypothesis is true, μ = 0, but the hypothesis that μ = 0 is weaker than the null hypothesis: If μ ≠ 0, the null hypothesis is false, but the null hypothesis can be false and yet still μ = 0. (That occurs if the number of consumers who would have bought something if all had been shown the targeted ads is equal to the number of consumers who would have bought something if all had been shown the control ads, but the purchases were made by a different subset of consumers.) The alternative hypothesis, that targeting helps, is that μ > 0. We would like to test the null hypothesis at significance level 5%.

Let X_t be the number of sales to consumers in the treatment group, the sum of the observed values of t_i. If the null hypothesis is true, the same G consumers would have bought whether they were assigned to treatment or to control, and the number of the consumers in the treatment group who bought something is the number of those G in a simple random sample of size n_t from the population of N consumers. Thus, for any fixed values of N, G, and n_t, X_t has an hypergeometric distribution with parameters N, G, and n_t.

If the alternative hypothesis μ > 0 is true, X_t tends to be larger than it would if the null hypothesis is true, so we should design our test to reject the null hypothesis for large values of X_t. That is, our rejection region should contain all values exceeding some threshold value x₀: We will reject the null hypothesis if X_t > x₀. We need to pick x₀ so that the test has the desired significance level, 5%.

We cannot calculate the threshold value x₀ until we know N, n_t, and G. Once we observe them, we can find the smallest value x₀ so that the probability that X_t is larger than x₀ if the null hypothesis is true is at most 5%, the chosen significance level. Our rule for testing the null hypothesis then is to reject the null hypothesis if X_t > x₀, and not to reject the null hypothesis otherwise. This is called Fisher's exact test for the equality of two percentages (against a one-sided alternative). The test is called exact because its probability of a Type I error can be computed exactly.

The Normal Approximation to Fisher's Exact Test

If N is large and n_t is neither close to zero nor close to N, computing the hypergeometric probabilities will be difficult, but the normal approximation to the probability distribution of X_t should be accurate provided G is neither too close to zero nor too close to n_t. To calculate the normal approximation, we need to convert X_t to standard units, which requires that we know the expected value and SE of X_t. The expected value of X_t is

and SD is the standard deviation of a list of N values of which G equal 1 and (N−G) equal 0:

Recall that we want to test the null hypothesis at significance level 5%. The area under the normal curve to the right of 1.645 standard units is 5%, which corresponds to the threshold value

the result is an (approximate) 5% significance level test of the null hypothesis that targeting has no effect, against the alternative hypothesis that targeting increases the fraction of consumers who buy. This is the normal approximation to Fisher's exact test; Z is called the Z-statistic, and the observed value of Z is called the z-score. examines in more generality the normal approximation to test statistics transformed approximately to standard units. Later in this chapter we shall see another example, the z test for equality of two percentages from independent samples.

To test at a significance level other than 5%, reject when Z exceeds a different threshold; choose the threshold so that the area under the normal curve to the right of the threshold is equal to the desired significance level. For example, to test at significance level 0.1, reject if Z>1.282. Some new notation will make it easier to express the general strategy. Recall that the normal curve is positive, and the total area under the normal curve is 100%, so the normal curve is like a histogram. Quantiles of the normal curve are defined thusly: For any number α between 0 and 100%, the α quantile of the normal curve, z_α, is the unique number such that the area under the normal curve to the left of z_α is α. For example, z_50% = 0, because the area under the normal curve to the left of zero is 50%. shows other commonly used quantiles of the normal curve.

Because the normal curve is symmetric about zero, z_100%−α = −z_α. Note that the area under the normal curve between z_α and z_100%−α is 100% − 2×α. Combining these two results shows that the area under the normal curve over the interval

is 100% − α, and thus the area under the normal curve outside the interval (the area under the normal curve over the complement of the interval) is α. This complement also can be written

With this notation for quantiles of the normal curve, it is easier to write down the rejection region of the normal approximation to Fisher's exact test for a general significance level: The significance level of the rule {Reject the null hypothesis if Z>z_100%−α} is approximately α.

The following exercise checks your ability to use the normal approximation to Fisher's exact test. The exercise is dynamic: The data will tend to change when you reload the page, so you can practice as much as you wish.

Testing the Equality of Two Percentages Using Independent Samples

In the experiment to test the effectiveness of targeted advertising using the randomization model described previously in this chapter, the samples from the populations of control and treatment values are dependent: Individual i has two numbers, c_i and t_i, and if we observe c_i we cannot observe t_i, and vice versa. If individual i is in the treatment group, he or she is not in the control group, and vice versa. Under the null hypothesis, the purchasers would have bought whether they were assigned to treatment or to control, and the non-purchasers would not have bought whether they were assigned to treatment or control, so the total number of purchasers does not depend on which consumers were assigned to treatment. That constancy led to an hypergeometric distribution for the number of purchasers in the treatment group under the null hypothesis. In this section, we see that Fisher's exact test allows us to test a slightly weaker null hypothesis when the data are two independent random samples with replacement from separate populations, a control group and a treatment group. This is the population model for comparing two percentages.

The weaker hypothesis is that the population percentage of the treatment group is equal to the population percentage of the control group. We also develop an approximate test for the equality of two percentages based on the sample percentages of independent random samples with replacement from two populations. The approximate test is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are large.

Fisher's Exact Test Using Independent Samples

Suppose there are two populations of tickets labeled 0 and 1, a control group and a treatment group, with corresponding population percentages p_c and p_t. We want to test the null hypothesis that

We draw a random sample of size n_c with replacement from the control group, and compute the sample sum X_c. X_c has a binomial distribution with parameters n_c and p_c. We draw another random sample of size n_t with replacement from the treatment group, and compute the sample sum X_t. X_t has a binomial distribution with parameters n_t and p_t. We draw the two random samples independently of each other, so X_c and X_t are independent random variables. This scenario could correspond to an observational study, to a non-randomized experiment, or to a randomized experiment, depending upon how individuals came to be in the treatment group and the control group. The randomness in the problem at this point is in drawing the samples from the control group and the treatment group, not in assigning subjects to treatment or to control—that assignment occurred before we arrived on the scene. In this population model, we might be able to conclude from the data that the population percentages differ for the treatment and control groups (that p_t ≠p_c), but even then we should not conclude that treatment has an effect unless the assignment of subjects to treatment and control was randomized. Otherwise, any real difference between p_t and p_c could be the result of confounding, rather than the result of the treatment.

In contrast, in the randomization model described earlier in the chapter, we might be able to conclude that the treatment has an effect for the N subjects in the randomization, but even then we should not extrapolate from those N subjects to conclude that treatment has an effect in the larger population from which they the subjects were drawn, because we did not know how they were drawn.

Let N = n_c + n_t. Let G = X_c+X_t be the sum of both the samples—the total number of ones. Given the value of G, the distribution of X_t is hypergeometric with parameters N, G, and n_t if the null hypothesis is true. This is proved in a footnote, but here is an explanation: If the null hypothesis is true, there is no difference between drawing with replacement from the treatment group and drawing with replacement from the control group, so every way of allocating the N observed values into a group of size n_t and a group of size n_c is equally likely. There are _GC_{n_t} such ways, of which

result in x ones and n_t − x zeros among the n_t values drawn from the treatment group, so the chance that X_t=x is

given the value of G, X_t has an hypergeometric distribution with parameters N, G, and n_t if the null hypothesis is true. Thus, under the null hypothesis that p_c=p_t, given the total number G of ones in the sample, the test statistic X_t has the same distribution for this sampling design that the test statistic X_t did for a population of N subjects assigned randomly to treatment or control. Therefore, the same testing procedure, Fisher's exact test, can be used to test the null hypothesis that p_c=p_t using independent random samples from two populations. In the previous section, the null hypothesis and the sampling design were different: Each subject i had two values, t_i and c_i; the null hypothesis was that

and each of the N subjects was assigned at random either to treatment or to control.

As noted previously, it is hard to perform the calculations needed to find the rejection region for this test when N is large; the normal approximation to Fisher's exact test described in the previous section is a computationally tractable way to construct the rejection region. The approximation is accurate under the same assumptions.

The Z Test for the Equality of Two Percentages using Independent Samples

In this section, we develop another approximate test of the null hypothesis that p_t=p_c in the population model; it turns out that this test is essentially the same as the normal approximation to Fisher's exact test, although it is motivated quite differently.

Let φ_c be the sample percentage of the random sample from the control group, and let φ_t be the sample percentage of the random sample from the treatment group. Suppose that the two sample sizes n_c and n_t are large (say, over 100 each). Then the normal approximations to the two sample percentages should be accurate (provided neither p_c nor p_t is too close to 0 or to 1). The expected value of the sample percentage of a random sample with replacement is the population percentage, so the expected value of φ_c is p_c, and the expected value of φ_t is p_t. The SE of φ_c is

Because the samples from the treatment and control groups are independent of each other, φ_t and φ_c are independent, so the SE of φ is

If the null hypothesis is true, the two population percentages are equal—p_t=p_c=p—and the two samples are like one larger sample from a single 0-1 box with a percentage p of tickets labeled "1." Let us call that box the null box. If the null hypothesis is true, the expected value of φ_t−c, E(φ_t−c), is zero, and

The first factor depends only on the sample sizes n_t and n_c, which we know. The second factor is the SD of the labels on the tickets in the null box. That factor depends only on p, the percentage of tickets labeled "1" in the null box. We do not know p, so we do not know the SD of the null box. However, we can use the bootstrap estimate of the SD of the null box because the sample size is large: let φ be the pooled sample percentage

The pooled bootstrap estimate of the SD of the null box is the estimate we get by pretending that the percentage of ones in the null box is equal to the percentage of ones in the pooled sample:

If the sample sizes are large and the null hypothesis is true, this will tend to be close to the true SD of the null box, and

will tend to be quite close to SE(φ_t−c). The normal approximation to the probability distribution of φ_t−c tells us that the chance that φ_t−c is in a given range is approximately equal to the area under the normal curve for the same range, converted to standard units. Under the null hypothesis, the expected value of φ_t−c is zero, and SE(φ_t−c) is approximately SE*(φ_t−c), so

is approximately φ_t−c in standard units: The chance that Z is in the range of values [a, b] is approximately the area under the normal curve between a and b.

Under the alternative hypothesis that p_t>p_c, Z will tend to be larger than it would under the null hypothesis. We can test the null hypothesis that p_t=p_c against the one-sided alternative hypothesis that p_t>p_c using

as the test statistic. To test at approximate significance level α, reject the null hypothesis if Z > z_1−α.

This is called the (one-sided) z test for equality of two percentages using independent samples. The random variable Z is called the Z-statistic, and the observed value of Z is called the z-score. To test the null hypothesis against the other one-sided alternative hypothesis that p_t<p_c at approximate significance level α, reject the null hypothesis if Z<z_α. To test the null hypothesis against the a two-sided alternative hypothesis that p_t≠p_c at approximate significance level α, reject when |Z| > z_1−a/2.

This test is based on transforming the difference of sample percentages (the test statistic) approximately to standard units, under the assumption that the null hypothesis is true. Because the null hypothesis specifies that the two population percentages are equal, the expected value of the difference between the sample percentages is zero—the expected values of both sample percentages is p. However, the null hypothesis does not specify the value of p, and the SE of the difference of sample percentages depends on p, so we cannot calculate the SE of the test statistic under the null hypotheses—we have to estimate SE(φ_{t − c}) from the data. When the combined sample size N=n_t+n_c is sufficiently large, the pooled bootstrap estimate of the SD of the null box is likely to be quite accurate, and the estimated SE is likely to be very closet to the true SE. When the individual sample sizes are large, the probability histogram of the difference of sample percentages can be approximated well by the normal curve.

When the sample is not large, the estimated SE will tend to differ from the true SE, and the normal approximation to the distribution of the difference of sample percentages will not be accurate. Then, the actual significance level of the z test can be very different from its nominal significance level, and we need to be more circumspect; see

The following exercise checks your ability to calculate the z test for equality of two percentages from independent samples.

The Normal Approximation to Fisher's Exact Test and the z Test for Equality of Two Percentages

We derived the z test for equality of two percentages using the assumption that the two samples are independent and that their sizes are fixed in advance. We derived Fisher's exact test by conditioning on the total number of tickets in the sample labeled "1," and found that the test could be used in two quite different situations: to test the hypothesis that treatment has no effect when a fixed collection of individuals are randomized into treatment and control groups, so the treatment and control samples are dependent; and to test the hypothesis that two population percentages are equal from independent samples from the two populations.

Somewhat surprisingly, the normal approximation to Fisher's exact test is essentially the z test when the sample sizes are all large. (The difference is just the −1 in the denominator of the finite population correction, which is negligible if the samples are large.) That is, the z score in the normal approximation to Fisher's exact test is almost exactly equal to the z score in the z-test for equality of two percentages using independent samples: The two tests reject for essentially the same observed data values.

The following example illustrates the approximate equivalence between the z test and the normal approximation to Fisher's exact test. The example is dynamic: The data will tend to change when you reload the page, to provide more examples of the computations involved.

It is rather surprising that tests derived under different assumptions behave so similarly. Generally, when the assumptions of a test are violated, the nominal significance level will be incorrect and the test should not be used. This is a rare exception.

Summary

Suppose two variables, C and T, are defined for a group of N individuals: c_i is the value of C for the ith individual, and t_i is the value of T for the ith individual, i=1, 2, …, N. Suppose each c_i and each t_i can equal either 0 or 1, so that

is the population percentage of the values of T. A simple random sample of size n_t will be taken from the population. The values of t_i are observed for the units in the sample; for the N−n_t units not in the sample, the values of c_i are observed instead. This is the randomization model for evaluating whether a treatment has an effect in an experiment in which a fixed set of N units are assigned at random either to treatment or to control. The response of individual i is t_i if he is treated and c_i if not. At issue is whether the treatment has an effect. The null hypothesis is that treatment does not matter at all: c_i=t_i, for every individual i. Let G be the sum of all the observations, the observed values of c_i plus the observed values of t_i. Let X_t be the sum of the observed values of t_i.

If the null hypothesis is true, the n_t observed values of t_i are like a random sample from a 0-1 box of N tickets of which G are labeled 1. Thus X_t has an hypergeometric distribution with parameters N, G, and n_t. Fisher’s exact test uses X_t as the test statistic, and this hypergeometric distribution to select the rejection region. If the alternative hypothesis is that p_t > p_c, then if the alternative hypothesis is true X_t would tend to be larger than it would be if the null hypothesis is true, so the hypothesis test should be of the form {Reject if X_t>x₀}, with x₀ chosen so that the test has the desired significance level. If the sample sizes are large, it can be difficult to calculate the rejection region for Fisher's exact test; then the normal approximation to the hypergeometric distribution can be used to construct a test with approximately the correct significance level. In the normal approximation to Fisher's exact test, the rejection region for approximate significance level a uses the threshold for rejection

where f is the finite population correction (N−n_t)^½/(N−1)^½ and z_1−α is the 1 − α quantile of the normal curve. The α quantile of the normal curve, z_α, is the number for which the area under the normal curve from minus infinity to z_α equals α. For example, z_0.05=−1.645, and z_0.95=1.645.

A Z-statistic is a test statistic whose probability histogram can be approximated well by a normal curve if the null hypothesis is true. The observed value of a Z-statistic is called the z-score. In Fisher's exact test,

Suppose one wants to test the null hypothesis that two population percentages are equal, p_t=p_c, on the basis of independent random samples with replacement from the two populations. This is the population model for comparing two population percentages. Let n_t denote the size of the random sample from the first population; let n_c be the size of the sample from the second population; and let N=n_t+n_c be the total sample size. Let X_t denote the sample sum of the first sample; let X_c denote the sample sum of the second sample; and let

denote the sum of the two samples. Conditional on the value of G, the probability distribution of X_t is hypergeometric with parameters N, G, and n_t, so Fisher's exact test can be used to test the null hypothesis. There is a different approximate approach based on the normal approximation to the probability distribution of the sample percentages: Let φ_t denote the sample percentage of the sample from the first population; let φ_c denote the sample percentage of the sample from the second population; and let φ denote the overall sample percentage of the two samples pooled together,

is the pooled bootstrap estimate of the SD of the null box. Under the null hypothesis, for large sample sizes n_t and n_c, the probability histogram of

can be approximated accurately by the normal curve, so Z is a Z-statistic. To test the null hypothesis against the one-sided alternative that p_t<p_c at approximate significance level α, use a one-sided test that rejects the null hypothesis when Z<z_α. To test the null hypothesis against the one-sided alternative that p_t>p_c at approximate significance level α, use a one-sided test that rejects the null hypothesis when Z>z_1−α. To test the null hypothesis against the two-sided alternative that p_t≠p_c at approximate significance level α, use a two-sided test that rejects the null hypothesis when |Z|≥z_1−α/2. The Z test for the equality of two percentages is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are all large, even though the assumptions of the tests differ.

Testing Equality of Two Percentages