# Testing Equality of Two Percentages

introduced a conceptual framework for statistical hypothesis testing. presented important statistical considerations for determining whether a treatment has an effect. Treatment is meant loosely—it could be a drug, an advertising campaign, a car wax, a test preparation course, a fertilizer, etc. The best way to determine whether a treatment has an effect is to use the method of comparison in an experiment in which subjects are assigned at random to the treatment group or the control group.

When the measurement of each subject can be represented by 0 or 1 (e.g., subject's condition improves or not, subject buys something or not, subject clicks a link or not, subject passes an exam or not), deciding whether the treatment has an effect is essentially testing the null hypothesis that two percentages are equal—which is the problem this chapter addresses.

Different ways of drawing samples lead to different tests. In one sampling design (the randomization model), the entire collection of subjects is allocated randomly between treatment and control, which makes the samples dependent. Conditioning on the total number of ones in the treatment and control groups leads to Fisher's exact test, which is based on the hypergeometric distribution of the number of ones in the treatment group if the null hypothesis is true. When the sample sizes are large, calculating the rejection region for Fisher's Exact Test is cumbersome, but the normal approximation to the hypergeometric distribution gives an approximate test—a test whose significance level is approximately what it claims to be.

In a second sampling design (the population model), the two samples are independent random samples with replacement from two populations; conditioning on the total number of ones in the two samples again leads to Fisher's exact test, which can be approximated as before.

There is another approximate approach to testing the null hypothesis in the population model: If the sample sizes are large (but the samples are drawn with replacement or are small compared to the two population sizes), the normal approximation to the distribution of the difference between the two sample percentages tends to be accurate. If the null hypothesis is true, the expected value of the difference between the sample percentages is zero, and the SE of the difference in sample percentages can be estimated by pooling the two samples. That allows one to transform the difference of sample percentages approximately into standard units, and to base an hypothesis test on the normal approximation to the probability distribution of the approximately standardized difference. Surprisingly, the resulting approximate test is essentially the normal approximation to Fisher's exact test, even though the assumptions of the two tests are different.

## Fisher's Exact Test for an Effect—Dependent Samples

Suppose we own a start-up company that offers e-tailers a service for targeting their Web advertising. Consumers register with our service by filling out a form indicating their likes and dislikes, gender, age, etc. We store cookies on each consumer’s computer to keep track of who he is. When a consumer with one of our cookies visits the Web site of any of our clients, we use the consumer's likes and dislikes to select (from a collection of the client's ads) the advertisement we think he is most likely to respond to. This is called targeted advertising. The targeting service is free to consumers; we charge the e-tailers. We can raise venture capital if we can show that targeting makes e-tailers' advertisements more effective.

We offer our service free to a large e-tailer. The e-tailer has a collection of advertisements that it usually uses in rotation: Each time a consumer arrives at the site, the server selects the next ad in the sequence to show to the consumer; the cycle starts over when all the ads have been shown.

To test whether targeting works, we implement a randomized, controlled, blind experiment by installing our software on the e-tailer's server to work as follows: Each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software selects, and with probability 50% the server shows the consumer the next ad in the rotation—the control ad. The decision of whether to show a consumer the targeted ad or the control ad is independent from consumer to consumer. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. The consumers who were shown the targeted ad comprise the treatment group; the other consumers comprise the control group. If a consumer visits the site more than once during the trial period, we ignore all of that consumer's visits but the first. Each subject (consumer) is assigned at random either to treatment or to control, and no subject knows which group he is in, so this is a controlled, randomized, blind experiment. There is no subjective element to determining whether a subject purchased something, so the lack of a double blind does not introduce bias.

We can think of the experiment in the following way: The ith consumer has a ticket with two numbers on it: The first number, ci, is 1 if the consumer would have bought something if shown the control ad, and 0 if not. The second number, ti, is 1 if the consumer would have bought something if shown the targeted ad, and 0 if not. There are N tickets in all. Under the null hypothesis that targeting has no effect, ti=ci for each i=1, 2,  … , N. That is, each consumer either will buy or will not buy, regardless of the ad he is shown: Whether he will buy is determined before he is assigned to treatment or control.

For the ith consumer, we observe either ci or ti, but not both. The percentage of consumers who would have purchased something if every consumer had been shown the control ads is

pc = ( c1 + c2 + … + cN )/N.

The percentage of consumers who would have bought something if all had been shown the targeted ads is

pt = ( t1 + t2 + … + tN )/N.

Let

μ = pt − pc

be the difference between the percentage of consumers who would have bought had all been shown the targeted ad, and the percentage of consumers who would have bought had all been shown the control ad. Under the null hypothesis that targeting does not make a difference, ti = ci for all i=1, 2, … , N. Thus if the null hypothesis is true, μ = 0, but the hypothesis that μ =  0 is weaker than the null hypothesis: If μ ≠ 0, the null hypothesis is false, but the null hypothesis can be false and yet still μ = 0. (That occurs if the number of consumers who would have bought something if all had been shown the targeted ads is equal to the number of consumers who would have bought something if all had been shown the control ads, but the purchases were made by a different subset of consumers.) The alternative hypothesis, that targeting helps, is that μ > 0. We would like to test the null hypothesis at significance level 5%.

Let Xt be the number of sales to consumers in the treatment group, the sum of the observed values of ti. If the null hypothesis is true, the same G consumers would have bought whether they were assigned to treatment or to control, and the number of the consumers in the treatment group who bought something is the number of those G in a simple random sample of size nt from the population of N consumers. Thus, for any fixed values of N, G, and nt, Xt has an hypergeometric distribution with parameters N, G, and nt.

If the alternative hypothesis μ > 0 is true, Xt tends to be larger than it would if the null hypothesis is true, so we should design our test to reject the null hypothesis for large values of Xt. That is, our rejection region should contain all values exceeding some threshold value x0: We will reject the null hypothesis if Xt > x0. We need to pick x0 so that the test has the desired significance level, 5%.

We cannot calculate the threshold value x0 until we know N, nt, and G. Once we observe them, we can find the smallest value x0 so that the probability that Xt is larger than x0 if the null hypothesis is true is at most 5%, the chosen significance level. Our rule for testing the null hypothesis then is to reject the null hypothesis if Xt > x0, and not to reject the null hypothesis otherwise. This is called Fisher's exact test for the equality of two percentages (against a one-sided alternative). The test is called exact because its probability of a Type I error can be computed exactly.

## The Normal Approximation to Fisher's Exact Test

If N is large and nt is neither close to zero nor close to N, computing the hypergeometric probabilities will be difficult, but the normal approximation to the probability distribution of Xt should be accurate provided G is neither too close to zero nor too close to nt. To calculate the normal approximation, we need to convert Xt to standard units, which requires that we know the expected value and SE of Xt. The expected value of Xt is

E(Xt) = nt×G/N,

and the SE of Xt is

SE(Xt) = f ×nt½×SD,

where f is the finite population correction

f = (N − nt)½/(N−1)½,

and SD is the standard deviation of a list of N values of which G equal 1 and (N−G) equal 0:

SD = ( G/N × (1 − G/N) )½.

In standard units, Xt is

Z = (Xt − E(Xt))/SE(Xt) = (Xt − nt×G/N )/(f×nt½×SD).

Recall that we want to test the null hypothesis at significance level 5%. The area under the normal curve to the right of 1.645 standard units is 5%, which corresponds to the threshold value

x0 = E(Xt) + 1.645×SE(Xt) = nt×G/N + 1.645×f×nt½×SD

= nt×G/N + 1.645×f×nt½× ( G/N × (1 − G/N) )½

in the original units. Thus if we reject the null hypothesis when

Z>1.645,

or equivalently when

Xt > nt×G/N + 1.645×f×nt½×( G/N × (1 − G/N) )½,

the result is an (approximate) 5% significance level test of the null hypothesis that targeting has no effect, against the alternative hypothesis that targeting increases the fraction of consumers who buy. This is the normal approximation to Fisher's exact test; Z is called the Z-statistic, and the observed value of Z is called the z-score. examines in more generality the normal approximation to test statistics transformed approximately to standard units. Later in this chapter we shall see another example, the z test for equality of two percentages from independent samples.

To test at a significance level other than 5%, reject when Z exceeds a different threshold; choose the threshold so that the area under the normal curve to the right of the threshold is equal to the desired significance level. For example, to test at significance level 0.1, reject if Z>1.282. Some new notation will make it easier to express the general strategy. Recall that the normal curve is positive, and the total area under the normal curve is 100%, so the normal curve is like a histogram. Quantiles of the normal curve are defined thusly: For any number α between 0 and 100%, the α quantile of the normal curve, zα, is the unique number such that the area under the normal curve to the left of zα is α. For example, z50% = 0, because the area under the normal curve to the left of zero is 50%. shows other commonly used quantiles of the normal curve.

Because the normal curve is symmetric about zero, z100%−α = −zα. Note that the area under the normal curve between zα and z100%−α is 100% − 2×α. Combining these two results shows that the area under the normal curve over the interval

[−z100%−α/2, z100%−α/2]

is 100% − α, and thus the area under the normal curve outside the interval (the area under the normal curve over the complement of the interval) is α. This complement also can be written

all values z such that |z| > z100%−α/2.

With this notation for quantiles of the normal curve, it is easier to write down the rejection region of the normal approximation to Fisher's exact test for a general significance level: The significance level of the rule {Reject the null hypothesis if Z>z100%−α} is approximately α.

The following exercise checks your ability to use the normal approximation to Fisher's exact test. The exercise is dynamic: The data will tend to change when you reload the page, so you can practice as much as you wish.

## Testing the Equality of Two Percentages Using Independent Samples

In the experiment to test the effectiveness of targeted advertising using the randomization model described previously in this chapter, the samples from the populations of control and treatment values are dependent: Individual i has two numbers, ci and ti, and if we observe ci we cannot observe ti, and vice versa. If individual i is in the treatment group, he or she is not in the control group, and vice versa. Under the null hypothesis, the purchasers would have bought whether they were assigned to treatment or to control, and the non-purchasers would not have bought whether they were assigned to treatment or control, so the total number of purchasers does not depend on which consumers were assigned to treatment. That constancy led to an hypergeometric distribution for the number of purchasers in the treatment group under the null hypothesis. In this section, we see that Fisher's exact test allows us to test a slightly weaker null hypothesis when the data are two independent random samples with replacement from separate populations, a control group and a treatment group. This is the population model for comparing two percentages.

The weaker hypothesis is that the population percentage of the treatment group is equal to the population percentage of the control group. We also develop an approximate test for the equality of two percentages based on the sample percentages of independent random samples with replacement from two populations. The approximate test is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are large.

### Fisher's Exact Test Using Independent Samples

Suppose there are two populations of tickets labeled 0 and 1, a control group and a treatment group, with corresponding population percentages pc and pt. We want to test the null hypothesis that

pc = pt;

i.e., that

μ = pt − pc = 0.

We draw a random sample of size nc with replacement from the control group, and compute the sample sum Xc. Xc has a binomial distribution with parameters nc and pc. We draw another random sample of size nt with replacement from the treatment group, and compute the sample sum Xt. Xt has a binomial distribution with parameters nt and pt. We draw the two random samples independently of each other, so Xc and Xt are independent random variables. This scenario could correspond to an observational study, to a non-randomized experiment, or to a randomized experiment, depending upon how individuals came to be in the treatment group and the control group. The randomness in the problem at this point is in drawing the samples from the control group and the treatment group, not in assigning subjects to treatment or to control—that assignment occurred before we arrived on the scene. In this population model, we might be able to conclude from the data that the population percentages differ for the treatment and control groups (that pt ≠pc), but even then we should not conclude that treatment has an effect unless the assignment of subjects to treatment and control was randomized. Otherwise, any real difference between pt and pc could be the result of confounding, rather than the result of the treatment.

In contrast, in the randomization model described earlier in the chapter, we might be able to conclude that the treatment has an effect for the N subjects in the randomization, but even then we should not extrapolate from those N subjects to conclude that treatment has an effect in the larger population from which they the subjects were drawn, because we did not know how they were drawn.

Let N = nc + nt. Let G = Xc+Xt be the sum of both the samples—the total number of ones. Given the value of G, the distribution of Xt is hypergeometric with parameters N, G, and nt if the null hypothesis is true. This is proved in a footnote, but here is an explanation: If the null hypothesis is true, there is no difference between drawing with replacement from the treatment group and drawing with replacement from the control group, so every way of allocating the N observed values into a group of size nt and a group of size nc is equally likely. There are GCnt such ways, of which

GCx× N−GCnt−x

result in x ones and nt − x zeros among the nt values drawn from the treatment group, so the chance that Xt=x is

GCx × N−GCnt−x/GCnt:

given the value of G, Xt has an hypergeometric distribution with parameters N, G, and nt if the null hypothesis is true. Thus, under the null hypothesis that pc=pt, given the total number G of ones in the sample, the test statistic Xt has the same distribution for this sampling design that the test statistic Xt did for a population of N subjects assigned randomly to treatment or control. Therefore, the same testing procedure, Fisher's exact test, can be used to test the null hypothesis that pc=pt using independent random samples from two populations. In the previous section, the null hypothesis and the sampling design were different: Each subject i had two values, ti and ci; the null hypothesis was that

ti = ci for all i=1, 2, … ,N;

and each of the N subjects was assigned at random either to treatment or to control.

As noted previously, it is hard to perform the calculations needed to find the rejection region for this test when N is large; the normal approximation to Fisher's exact test described in the previous section is a computationally tractable way to construct the rejection region. The approximation is accurate under the same assumptions.

### The Z Test for the Equality of Two Percentages using Independent Samples

In this section, we develop another approximate test of the null hypothesis that pt=pc in the population model; it turns out that this test is essentially the same as the normal approximation to Fisher's exact test, although it is motivated quite differently.

Let φc be the sample percentage of the random sample from the control group, and let φt be the sample percentage of the random sample from the treatment group. Suppose that the two sample sizes nc and nt are large (say, over 100 each). Then the normal approximations to the two sample percentages should be accurate (provided neither pc nor pt is too close to 0 or to 1). The expected value of the sample percentage of a random sample with replacement is the population percentage, so the expected value of φc is pc, and the expected value of φt is pt. The SE of φc is

SE(φc) = ( pc×(1−pc) )½/nc½,

and the SE of φt is

SE(φt) = ( pt×(1−pt) )½/nt½.

Consider the difference of the two sample percentages

φt−c = φt − φc.

The difference φt−c is a random variable. The expected value of φt−c is

μ = pt − pc.

Because the samples from the treatment and control groups are independent of each other, φt and φc are independent, so the SE of φ is

SE(φt−c) = ( SE2t) + SE2c) )½.

If the null hypothesis is true, the two population percentages are equal—pt=pc=p—and the two samples are like one larger sample from a single 0-1 box with a percentage p of tickets labeled "1." Let us call that box the null box. If the null hypothesis is true, the expected value of φt−c, E(φt−c), is zero, and

SE(φt−c) = (p×(1−p)/nt + p×(1−p)/nc )½

= ( 1/nt + 1/nc )½ × (p×(1−p))½

= ( N/(nt×nc) )½ × (p×(1−p))½.

The first factor depends only on the sample sizes nt and nc, which we know. The second factor is the SD of the labels on the tickets in the null box. That factor depends only on p, the percentage of tickets labeled "1" in the null box. We do not know p, so we do not know the SD of the null box. However, we can use the bootstrap estimate of the SD of the null box because the sample size is large: let φ be the pooled sample percentage

φ = (total number of "1"s in both samples)/(total sample size)

= (nc×φc + nt×φt)/N.

The pooled bootstrap estimate of the SD of the null box is the estimate we get by pretending that the percentage of ones in the null box is equal to the percentage of ones in the pooled sample:

s* = (pooled bootstrap estimate of SD of the null box) = ( φ×(1−φ) )½.

If the sample sizes are large and the null hypothesis is true, this will tend to be close to the true SD of the null box, and

SE*(φt−c) = ( N/(nt×nc) )½×s*

will tend to be quite close to SE(φt−c). The normal approximation to the probability distribution of φt−c tells us that the chance that φt−c is in a given range is approximately equal to the area under the normal curve for the same range, converted to standard units. Under the null hypothesis, the expected value of φt−c is zero, and SE(φt−c) is approximately SE*(φt−c), so

Z = φt−c/SE*(φt−c)

is approximately φt−c in standard units: The chance that Z is in the range of values [a, b] is approximately the area under the normal curve between a and b.

Under the alternative hypothesis that pt>pc, Z will tend to be larger than it would under the null hypothesis. We can test the null hypothesis that pt=pc against the one-sided alternative hypothesis that pt>pc using

Z = φt−c/SE*(φt−c)

as the test statistic. To test at approximate significance level α, reject the null hypothesis if Z > z1−α.

This is called the (one-sided) z test for equality of two percentages using independent samples. The random variable Z is called the Z-statistic, and the observed value of Z is called the z-score. To test the null hypothesis against the other one-sided alternative hypothesis that pt<pc at approximate significance level α, reject the null hypothesis if Z<zα. To test the null hypothesis against the a two-sided alternative hypothesis that pt≠pc at approximate significance level α, reject when |Z| > z1−a/2.

This test is based on transforming the difference of sample percentages (the test statistic) approximately to standard units, under the assumption that the null hypothesis is true. Because the null hypothesis specifies that the two population percentages are equal, the expected value of the difference between the sample percentages is zero—the expected values of both sample percentages is p. However, the null hypothesis does not specify the value of p, and the SE of the difference of sample percentages depends on p, so we cannot calculate the SE of the test statistic under the null hypotheses—we have to estimate SE(φt − c) from the data. When the combined sample size N=nt+nc is sufficiently large, the pooled bootstrap estimate of the SD of the null box is likely to be quite accurate, and the estimated SE is likely to be very closet to the true SE. When the individual sample sizes are large, the probability histogram of the difference of sample percentages can be approximated well by the normal curve.

When the sample is not large, the estimated SE will tend to differ from the true SE, and the normal approximation to the distribution of the difference of sample percentages will not be accurate. Then, the actual significance level of the z test can be very different from its nominal significance level, and we need to be more circumspect; see

The following exercise checks your ability to calculate the z test for equality of two percentages from independent samples.

## The Normal Approximation to Fisher's Exact Test and the z Test for Equality of Two Percentages

We derived the z test for equality of two percentages using the assumption that the two samples are independent and that their sizes are fixed in advance. We derived Fisher's exact test by conditioning on the total number of tickets in the sample labeled "1," and found that the test could be used in two quite different situations: to test the hypothesis that treatment has no effect when a fixed collection of individuals are randomized into treatment and control groups, so the treatment and control samples are dependent; and to test the hypothesis that two population percentages are equal from independent samples from the two populations.

Somewhat surprisingly, the normal approximation to Fisher's exact test is essentially the z test when the sample sizes are all large. (The difference is just the −1 in the denominator of the finite population correction, which is negligible if the samples are large.) That is, the z score in the normal approximation to Fisher's exact test is almost exactly equal to the z score in the z-test for equality of two percentages using independent samples: The two tests reject for essentially the same observed data values.

The following example illustrates the approximate equivalence between the z test and the normal approximation to Fisher's exact test. The example is dynamic: The data will tend to change when you reload the page, to provide more examples of the computations involved.

It is rather surprising that tests derived under different assumptions behave so similarly. Generally, when the assumptions of a test are violated, the nominal significance level will be incorrect and the test should not be used. This is a rare exception.

## Summary

Suppose two variables, C and T, are defined for a group of N individuals: ci is the value of C for the ith individual, and ti is the value of T for the ith individual, i=1, 2, …, N. Suppose each ci and each ti can equal either 0 or 1, so that

pc=(c1 + c2 + … + cN)/N

is the population percentage of the values of C, and

pt=(t1 + t2 + … + tN)/N

is the population percentage of the values of T. A simple random sample of size nt will be taken from the population. The values of ti are observed for the units in the sample; for the N−nt units not in the sample, the values of ci are observed instead. This is the randomization model for evaluating whether a treatment has an effect in an experiment in which a fixed set of N units are assigned at random either to treatment or to control. The response of individual i is ti if he is treated and ci if not. At issue is whether the treatment has an effect. The null hypothesis is that treatment does not matter at all: ci=ti, for every individual i. Let G be the sum of all the observations, the observed values of ci plus the observed values of ti. Let Xt be the sum of the observed values of ti.

If the null hypothesis is true, the nt observed values of ti are like a random sample from a 0-1 box of N tickets of which G are labeled 1. Thus Xt has an hypergeometric distribution with parameters N, G, and nt. Fisher’s exact test uses Xt as the test statistic, and this hypergeometric distribution to select the rejection region. If the alternative hypothesis is that pt > pc, then if the alternative hypothesis is true Xt would tend to be larger than it would be if the null hypothesis is true, so the hypothesis test should be of the form {Reject if Xt>x0}, with x0 chosen so that the test has the desired significance level. If the sample sizes are large, it can be difficult to calculate the rejection region for Fisher's exact test; then the normal approximation to the hypergeometric distribution can be used to construct a test with approximately the correct significance level. In the normal approximation to Fisher's exact test, the rejection region for approximate significance level a uses the threshold for rejection

x0=nt×G/N + z1 − α×f×nt½×(G/N ×(1 − G/N))½,

where f is the finite population correction (N−nt)½/(N−1)½ and z1−α is the 1 − α quantile of the normal curve. The α quantile of the normal curve, zα, is the number for which the area under the normal curve from minus infinity to zα equals α. For example, z0.05=−1.645, and z0.95=1.645.

A Z-statistic is a test statistic whose probability histogram can be approximated well by a normal curve if the null hypothesis is true. The observed value of a Z-statistic is called the z-score. In Fisher's exact test,

Z = (Xt−nt × G/N)/(f×nt½×(G/N ×(1−G/N))½)

is a Z statistic.

Suppose one wants to test the null hypothesis that two population percentages are equal, pt=pc, on the basis of independent random samples with replacement from the two populations. This is the population model for comparing two population percentages. Let nt denote the size of the random sample from the first population; let nc be the size of the sample from the second population; and let N=nt+nc be the total sample size. Let Xt denote the sample sum of the first sample; let Xc denote the sample sum of the second sample; and let

G=Xt+Xc

denote the sum of the two samples. Conditional on the value of G, the probability distribution of Xt is hypergeometric with parameters N, G, and nt, so Fisher's exact test can be used to test the null hypothesis. There is a different approximate approach based on the normal approximation to the probability distribution of the sample percentages: Let φt denote the sample percentage of the sample from the first population; let φc denote the sample percentage of the sample from the second population; and let φ denote the overall sample percentage of the two samples pooled together,

φ=(total number of "1"s in the two samples)/(total sample size) = G/N.

Then, if the null hypothesis is true,

E(φt−φc)=0.

If in addition nt and nc are large, SE(φt−φc) is approximately

s*×(1/nt + 1/nc)½,

where

s*=(φ×(1−φ))½

is the pooled bootstrap estimate of the SD of the null box. Under the null hypothesis, for large sample sizes nt and nc, the probability histogram of

Z = (φt−φc)/(s* × (1/nt + 1/nc)½)

can be approximated accurately by the normal curve, so Z is a Z-statistic. To test the null hypothesis against the one-sided alternative that pt<pc at approximate significance level α, use a one-sided test that rejects the null hypothesis when Z<zα. To test the null hypothesis against the one-sided alternative that pt>pc at approximate significance level α, use a one-sided test that rejects the null hypothesis when Z>z1−α. To test the null hypothesis against the two-sided alternative that pt≠pc at approximate significance level α, use a two-sided test that rejects the null hypothesis when |Z|≥z1−α/2. The Z test for the equality of two percentages is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are all large, even though the assumptions of the tests differ.

## Key Terms

• 0-1 box
• alternative hypothesis
• binomial distribution
• bootstrap estimate
• complement
• control group
• dependent
• expected value
• experiment
• finite population correction
• Fisher’s exact test
• histogram
• hypergeometric distribution
• hypothesis testing
• independent
• independent random sample
• normal approximation
• normal curve
• null hypothesis
• one-sided
• one-sided
• parameter
• pooled bootstrap estimate of the SD
• population model
• population percentage
• probability
• probability distribution
• probability histogram
• probability histogram
• quantile of the normal curve
• random sample
• random variable
• randomization model
• rejection region
• sample percentage
• sample size
• significance level
• simple random sample
• standard deviation
• standard error
• standard unit
• symmetric
• test statistic
• treatment
• treatment group
• two-sided
• Z statistic
• z-score
• z test