*etc*.
The best way to determine whether a treatment has an effect is to use the
method of comparison in an
experiment in which
subjects
are assigned at random to the treatment group
or the control group.

When the measurement of each subject can be represented by 0 or 1 (*e.g.*,
subject's condition improves or not, subject buys something or not, subject clicks a
link or not, subject passes an exam or not), deciding whether the treatment has an effect
is essentially testing the null hypothesis
that two percentages are equal—which is the problem this chapter addresses.

Different ways of drawing samples lead to different tests.
In one sampling design (the *randomization model*), the entire
collection of subjects is allocated randomly between treatment and control,
which makes the samples dependent.
Conditioning
on the total number of ones in the treatment and control groups leads to
*Fisher's exact test*, which is based on the
hypergeometric distribution
of the number of ones in the treatment group if the null hypothesis is true.
When the sample sizes are large, calculating the rejection region for
Fisher's Exact Test is cumbersome, but the
normal approximation
to the hypergeometric distribution gives an *approximate test*—a test
whose significance level
is approximately what it claims to be.

In a second sampling design (the *population model*), the two samples are
independent random samples with replacement
from two populations; conditioning on the total number of ones in the two samples
again leads to Fisher's exact test, which can be approximated as before.

There is another approximate approach to testing the null hypothesis in the population model: If the sample sizes are large (but the samples are drawn with replacement or are small compared to the two population sizes), the normal approximation to the distribution of the difference between the two sample percentages tends to be accurate. If the null hypothesis is true, the expected value of the difference between the sample percentages is zero, and the SE of the difference in sample percentages can be estimated by pooling the two samples. That allows one to transform the difference of sample percentages approximately into standard units, and to base an hypothesis test on the normal approximation to the probability distribution of the approximately standardized difference. Surprisingly, the resulting approximate test is essentially the normal approximation to Fisher's exact test, even though the assumptions of the two tests are different.

Suppose we own a start-up company that offers e-tailers a service for targeting their
Web advertising.
Consumers register with our service by filling out a form indicating their likes
and dislikes, gender, age, etc. We store cookies on each consumer’s computer to keep
track of who he is.
When a consumer with one of our cookies visits the Web site of any of our clients,
we use the consumer's likes and dislikes to select (from a collection of the client's ads)
the advertisement we think he is most likely to respond to.
This is called *targeted advertising*.
The targeting service is free to consumers; we charge the e-tailers.
We can raise venture capital if we can show that targeting makes e-tailers'
advertisements more effective.

We offer our service free to a large e-tailer. The e-tailer has a collection of advertisements that it usually uses in rotation: Each time a consumer arrives at the site, the server selects the next ad in the sequence to show to the consumer; the cycle starts over when all the ads have been shown.

To test whether targeting works, we implement a randomized, controlled, blind experiment by installing our software on the e-tailer's server to work as follows: Each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software selects, and with probability 50% the server shows the consumer the next ad in the rotation—the control ad. The decision of whether to show a consumer the targeted ad or the control ad is independent from consumer to consumer. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. The consumers who were shown the targeted ad comprise the treatment group; the other consumers comprise the control group. If a consumer visits the site more than once during the trial period, we ignore all of that consumer's visits but the first. Each subject (consumer) is assigned at random either to treatment or to control, and no subject knows which group he is in, so this is a controlled, randomized, blind experiment. There is no subjective element to determining whether a subject purchased something, so the lack of a double blind does not introduce bias.

Suppose that N consumers visit the site during the trial, that
n_{t} of
them are assigned to the treatment group, that n_{c}
of them are assigned
to the control group, and that G of the consumers buy something.
(The mnemonic is that c stands for control,
t for treatment,
and G for the number of *good*
customers—customers who buy something.)
We want to know whether the targeting affects whether subjects buy anything.
Only some of the consumers see the targeted ad, and only some see the
control ad, so answering this question involves hypothetical
counterfactual situations—what would have happened had the all the
consumers been shown a targeted ad, and what would have happened
had all the consumers been shown a control ad.
We treat the N consumers as a fixed group, without regard for how
they were drawn from the more general population of people who shop online.
Any conclusions we draw about the consumers who visited the site might not
hold for the general population: We should be wary of extrapolating the
results to consumers who were not in the sample unless we know that the
randomized group is itself a random sample from the larger population.
This set-up, in which the N subjects are a fixed group and the only
random element is in allocating some of the subjects to the treatment
group and the rest to the control group, is called the *randomization model*.
Later in this chapter we consider a *population model*, in which the
treatment group and the control group are random samples from a much larger
population.
In the population model, the null hypothesis will be slightly different,
but we shall be able to extrapolate the results from the samples to the
populations from which they were drawn, because they were drawn at random.

We can think of the experiment in the following way: The ith
consumer has a ticket with two numbers on it:
The first number,
c_{i}, is 1 if the consumer would have bought something if shown the
control ad, and 0 if not.
The second number, t_{i}, is 1 if the consumer would have bought something
if shown the targeted ad, and 0 if not.
There are N tickets in all.
Under the null hypothesis that targeting has no effect,
t_{i}=c_{i} for each
i=1, 2, … , N.
That is, each consumer either will buy or will not buy,
regardless of the ad he is shown: Whether he will buy is determined
before he is assigned to treatment or control.

For the ith consumer, we observe either c_{i}
or t_{i}, but not both.
The percentage of consumers who would have purchased something
if every consumer had been shown the control ads is

p_{c} = ( c_{1} + c_{2} + … +
c_{N} )/N.

The percentage of consumers who would have bought something if all had been shown the targeted ads is

p_{t} = ( t_{1} + t_{2}
+ … + t_{N} )/N.

Let

μ = p_{t} − p_{c}

be the difference between the percentage of consumers who would have
bought had all been shown the targeted ad, and the percentage of
consumers who would have bought had all been shown the control ad.
Under the null hypothesis that targeting does not make a difference,
t_{i} = c_{i} for all
i=1, 2, … , N.
Thus if the null hypothesis is true, μ = 0, but the
hypothesis that μ = 0 is weaker than the null
hypothesis: If μ ≠ 0, the null hypothesis is
false, but the null hypothesis can be false and yet still μ = 0.
(That occurs if the number of consumers who would have bought something
if all had been shown the targeted ads is equal to the number of consumers
who would have bought something if all had been shown the control ads,
but the purchases were made by a different subset of consumers.)
The alternative hypothesis, that targeting helps, is that μ > 0.
We would like to test the null hypothesis at significance level 5%.

Let X_{t} be the number of sales to consumers
in the treatment group,
the sum of the observed values of t_{i}.
If the null hypothesis is true, the same G consumers would have bought
whether they were assigned to treatment or to control, and the number
of the consumers in the treatment group who bought something is the
number of those G in a simple random sample of size
n_{t} from the population of N consumers.
Thus, for any fixed values of N, G, and
n_{t}, X_{t} has an
hypergeometric distribution with parameters N, G,
and n_{t}.

If the alternative hypothesis μ > 0 is true,
X_{t} tends to
be larger than it would if the null hypothesis is true,
_{t}.
That is, our rejection region should contain all values exceeding
some threshold value x_{0}: We will reject the null hypothesis if
X_{t} > x_{0}.
We need to pick x_{0} so that the test has the desired significance level, 5%.

We cannot calculate the threshold value x_{0} until we know
N, n_{t}, and G.
Once we observe them, we can find the smallest value x_{0} so that the
probability that X_{t} is larger than
x_{0} if the
null hypothesis is true
is at most 5%, the chosen significance level.
Our rule for testing the null hypothesis then is to reject the null
hypothesis if X_{t} > x_{0},
and not to reject the null hypothesis otherwise.
This is called *Fisher's exact test* for the equality of two
percentages (against a one-sided alternative).
The test is called *exact* because its probability of a
Type I error can be computed exactly.

If N is large and n_{t}
is neither close to zero nor close to N,
computing the hypergeometric probabilities will be difficult,
but the normal approximation
to the probability distribution of X_{t}
should be accurate provided G
is neither too close to zero nor too close to n_{t}.
To calculate the normal approximation, we need to convert X_{t}
to standard units, which requires that we know the
expected value and SE
of X_{t}.
The expected value of X_{t} is

E(X_{t}) = n_{t}×G/N,

and the SE of X_{t} is

SE(X_{t}) = f
×n_{t}^{½}×SD,

where f is the finite population correction

f = (N − n_{t})^{½}/(N−1)^{½},

and SD is the standard deviation of a list of N values of which G equal 1 and (N−G) equal 0:

SD = ( G/N × (1 − G/N) )^{½}.

In standard units, X_{t} is

Z = (X_{t} −
E(X_{t}))/SE(X_{t}) =
(X_{t} − n_{t}×G/N
)/(f×n_{t}^{½}×SD).

Recall that we want to test the null hypothesis at significance level 5%.
The area under the normal curve
to the right of 1.645 standard units is 5%,

x_{0} = E(X_{t}) +
1.645×SE(X_{t}) = n_{t}×G/N +
1.645×f×n_{t}^{½}×SD

= n_{t}×G/N +
1.645×f×n_{t}^{½}×
( G/N × (1 − G/N) )^{½}

in the original units. Thus if we reject the null hypothesis when

Z>1.645,

or equivalently when

X_{t} > n_{t}×G/N +
1.645×f×n_{t}^{½}×(
G/N × (1 − G/N) )^{½},

the result is an (approximate) 5% significance level test of the null
hypothesis that targeting has no effect, against the alternative
hypothesis that targeting increases the fraction of consumers who buy.
This is the *normal approximation to Fisher's exact test*; Z is called the
*Z-statistic*, and the observed value of Z is called the *z-score*.
*z* test for
equality of two percentages from independent samples.

To test at a significance level other than 5%, reject when Z exceeds a
different threshold; choose the threshold so that the area under
the normal curve to the right of the threshold is equal to the desired
significance level.
For example, to test at significance level 0.1, reject if Z>1.282.
Some new notation will make it easier to express the general strategy.
Recall that the normal curve is positive, and the total area under the
normal curve is 100%, so the normal curve is like a histogram.
Quantiles of the normal curve are defined thusly: For any number α between 0 and 100%,
the α quantile of the normal curve, z_{α},
is the unique number such
that the area under the normal curve to the left of z_{α}
is α.
For example, z_{50%} = 0, because the area under the normal curve
to the left of zero is 50%.

Because the normal curve is symmetric about zero,
z_{100%−α} = −z_{α}.
Note that the area under the normal curve between z_{α} and
z_{100%−α} is
100% − 2×α.
Combining these two results shows that the area under the normal
curve over the interval

[−z_{100%−α/2}, z_{100%−α/2}]

is 100% − α, and thus the area under the normal curve outside the interval (the area under the normal curve over the complement of the interval) is α. This complement also can be written

all values z such that |z| > z_{100%−α/2}.

With this notation for quantiles of the normal curve, it is easier to write down the
rejection region of the normal approximation to Fisher's exact test for a general
significance level: The significance level of the rule
{Reject the null hypothesis if Z>z_{100%−α}}
is approximately α.

The following exercise checks your ability to use the normal approximation to Fisher's exact test. The exercise is dynamic: The data will tend to change when you reload the page, so you can practice as much as you wish.

In the experiment to test the effectiveness of targeted advertising using the
randomization model described previously in this chapter,
the samples from the populations of control and treatment values are dependent:
Individual i has two numbers, c_{i} and
t_{i}, and if we observe c_{i}
we cannot observe t_{i}, and vice versa.
If individual i is in the treatment group, he or she is not
in the control group, and vice versa. Under the null hypothesis,
the purchasers would have bought whether they were assigned
to treatment or to control, and the non-purchasers would not have bought
whether they were assigned to treatment or control, so the total number
of purchasers does not depend on which consumers were assigned to treatment.
That constancy led to an hypergeometric distribution for the number of
purchasers in the treatment group under the null hypothesis.
In this section, we see that Fisher's exact test allows us to test a
slightly weaker null hypothesis when the data are two independent random
samples with replacement from separate populations, a control group and a
treatment group.
This is the *population model* for comparing two percentages.

The weaker hypothesis is that the population percentage of the treatment group is equal to the population percentage of the control group. We also develop an approximate test for the equality of two percentages based on the sample percentages of independent random samples with replacement from two populations. The approximate test is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are large.

Suppose there are two populations of tickets labeled 0 and 1, a control
group and a treatment group, with corresponding population percentages
p_{c} and p_{t}.
We want to test the null hypothesis that

p_{c} = p_{t};

*i.e.*, that

μ = p_{t} − p_{c} = 0.

We draw a random sample of size n_{c} with replacement from the
control group, and compute
the sample sum X_{c}.
X_{c} has a binomial distribution with parameters
n_{c} and p_{c}.
We draw another random sample of size n_{t} with
replacement from the treatment group,
and compute the sample sum X_{t}.
X_{t} has a binomial distribution with parameters
n_{t} and p_{t}.
We draw the two random samples independently of each other, so
X_{c} and X_{t} are
independent random variables.
This scenario could correspond to an observational study, to a
non-randomized experiment, or to a randomized experiment, depending upon how individuals
came to be in the treatment group and the control group.
The randomness in the problem at this point is in drawing the samples
from the control group and the treatment group, not in assigning subjects
to treatment or to control—that assignment occurred before we
arrived on the scene.
In this population model, we might be able to conclude from the data
that the population percentages differ for the treatment and control
groups (that p_{t} ≠p_{c}),
but even then we should not conclude that treatment has an effect unless the
assignment of subjects to treatment and control was randomized.
Otherwise, any real difference between p_{t}
and p_{c} could be the result of confounding,
rather than the result of the treatment.

In contrast, in the randomization model described earlier in the
chapter, we might be able to conclude that the treatment has an
effect for the N subjects in the randomization, but even then we
should not extrapolate from those N subjects to conclude that
treatment has an effect in the larger population from which they the
subjects were drawn, because we did not know *how* they were drawn.

Let N = n_{c} + n_{t}.
Let G = X_{c}+X_{t}
be the sum of both the samples—the total number of ones.
Given the value of G, the distribution of X_{t}
is hypergeometric with parameters
N, G, and n_{t} if the null
hypothesis is true.
This is proved in a footnote,
_{t} and a group of size
n_{c} is equally likely.
There are _{G}C_{nt}
such ways, of which

_{G}C_{x}×
_{N−G}C_{nt−x}

result in x ones and n_{t} − x zeros among the
n_{t} values drawn from the treatment group,
so the chance that X_{t}=x is

_{G}C_{x} ×
_{N−G}C_{nt−x}/_{G}C_{nt}:

given the value of G, X_{t} has an hypergeometric
distribution with parameters N, G, and n_{t}
if the null hypothesis is true.
Thus, under the null hypothesis that
p_{c}=p_{t},
given the total number G of ones in the sample,
the test statistic
X_{t} has the same distribution for this
sampling design that the test statistic X_{t}
did for a population of N subjects assigned randomly to
treatment or control.
Therefore, the same testing procedure, Fisher's exact test, can be
used to test the null hypothesis that
p_{c}=p_{t}
using independent random samples from two populations.
In the previous section, the null hypothesis and the sampling design
were different: Each subject i had two values,
t_{i}
and c_{i}; the null
hypothesis was that

t_{i} = c_{i} for all
i=1, 2, … ,N;

and each of the N subjects was assigned at random either to treatment or to control.

As noted previously, it is hard to perform the calculations needed to find the rejection region for this test when N is large; the normal approximation to Fisher's exact test described in the previous section is a computationally tractable way to construct the rejection region. The approximation is accurate under the same assumptions.

In this section, we develop another approximate test of the null hypothesis that
p_{t}=p_{c}
in the population model; it turns out that this test is essentially the same
as the normal approximation to Fisher's exact test, although it is motivated
quite differently.

Let φ_{c} be the sample percentage of the random
sample from the control group, and let φ_{t} be the
sample percentage of the random sample from the treatment group.
Suppose that the two sample sizes n_{c} and
n_{t} are large (say, over 100 each).
Then the normal approximations to the two sample percentages should be
accurate (provided neither p_{c} nor
p_{t} is too close to 0 or to 1).
The expected value of the sample percentage of a random sample with
replacement is the population percentage, so the expected value of
φ_{c} is p_{c},
and the expected value of φ_{t} is
p_{t}.
The SE of φ_{c} is

SE(φ_{c}) = (
p_{c}×(1−p_{c})
)^{½}/n_{c}^{½},

and the SE of φ_{t} is

SE(φ_{t}) = (
p_{t}×(1−p_{t})
)^{½}/n_{t}^{½}.

Consider the difference of the two sample percentages

φ_{t−c} =
φ_{t} − φ_{c}.

The difference φ_{t−c} is a random variable.
The expected value of φ_{t−c} is

μ = p_{t} − p_{c}.

Because the samples from the treatment and control groups are independent of
each other, φ_{t} and φ_{c}
are independent, so the SE of φ is

SE(φ_{t−c}) = ( SE^{2}(φ_{t}) +
SE^{2}(φ_{c}) )^{½}.

If the null hypothesis is true, the two population percentages are
equal—p_{t}=p_{c}=p—and
the two samples are like one larger sample from a single 0-1 box with a percentage p
of tickets labeled "1."
Let us call that box the *null box*.
If the null hypothesis is true, the expected value of φ_{t−c},
E(φ_{t−c}), is zero, and

SE(φ_{t−c}) =
(p×(1−p)/n_{t} +
p×(1−p)/n_{c} )^{½}

= ( 1/n_{t} + 1/n_{c} )^{½}
× (p×(1−p))^{½}

= ( N/(n_{t}×n_{c}) )^{½}
× (p×(1−p))^{½}.

The first factor depends only on the sample sizes n_{t} and
n_{c}, which we know.
The second factor is the SD of the labels on the tickets in the null box.
That factor depends only on p, the percentage of tickets labeled "1"
in the null box.
We do not know p, so we do not know the SD of the null box.
However, we can use the bootstrap estimate of the SD of the
null box because the sample size is large: let φ be
the pooled sample percentage

φ = (total number of "1"s in both samples)/(total sample size)

= (n_{c}×φ_{c} +
n_{t}×φ_{t})/N.

The pooled bootstrap estimate of the SD of the null box is the estimate we get by pretending that the percentage of ones in the null box is equal to the percentage of ones in the pooled sample:

s^{*} = (pooled bootstrap estimate of SD of the null box) =
( φ×(1−φ) )^{½}.

If the sample sizes are large and the null hypothesis is true, this will tend to be close to the true SD of the null box, and

SE*(φ_{t−c}) =
( N/(n_{t}×n_{c})
)^{½}×s^{*}

will tend to be quite close to SE(φ_{t−c}).
The normal approximation to the probability distribution of
φ_{t−c} tells us that the chance
that φ_{t−c} is in a given range is
approximately equal to the area under the normal curve for the same range,
converted to standard units. Under the null hypothesis, the expected
value of φ_{t−c} is zero,
and SE(φ_{t−c}) is approximately
SE*(φ_{t−c}), so

Z = φ_{t−c}/SE*(φ_{t−c})

is approximately φ_{t−c} in standard units:
The chance that Z is in the range of values
[a, b] is approximately the
area under the normal curve between a and b.

Under the alternative hypothesis that
p_{t}>p_{c},
Z will tend to be larger than it would under the null hypothesis.
We can test the null hypothesis that
p_{t}=p_{c}
against the one-sided alternative hypothesis that
p_{t}>p_{c}
using

Z =
φ_{t−c}/SE*(φ_{t−c})

as the test statistic.
To test at approximate significance level α, reject the null hypothesis if
Z > z_{1−α}.

This is called the (one-sided) * z test for equality of two percentages using
independent samples*.
The random variable Z is called the

This test is based on transforming the difference of sample percentages
(the test statistic) approximately to standard units, under the assumption
that the null hypothesis is true.
Because the null hypothesis specifies that the two population percentages
are equal, the expected value of the difference between the sample percentages
is zero—the expected values of both sample percentages is p.
However, the null hypothesis does not specify the value of p,
and the SE of the difference of sample percentages depends on p,
so we cannot calculate the SE of the test statistic under the null
hypotheses—we have to estimate
SE(φ_{t − c})
from the data.
When the combined sample size
N=n_{t}+n_{c} is
sufficiently large,
the pooled bootstrap estimate of the SD of the null box is likely
to be quite accurate, and the estimated SE is likely to be very closet to the
true SE.
When the individual sample sizes are large, the probability histogram of
the difference of sample percentages can be approximated well by the normal curve.

When the sample is not large, the estimated SE will tend to differ from the
true SE, and the normal approximation to the distribution of the difference
of sample percentages will not be accurate.
Then, the actual significance level of the *z* test can be very different from its
nominal significance level,
and we need to be more circumspect; see

The following exercise checks your ability to calculate the *z* test for
equality of two percentages from independent samples.

We derived the *z* test for equality of two percentages using the assumption
that the two samples are independent and that their sizes are fixed in advance.
We derived Fisher's exact test by conditioning on the total number of tickets
in the sample labeled "1," and found that the test could be used in
two quite different situations: to test the hypothesis that treatment has no
effect when a fixed collection of individuals are randomized into treatment
and control groups, so the treatment and control samples are dependent;
and to test the hypothesis that two population percentages are equal from
independent samples from the two populations.

Somewhat surprisingly, the normal approximation to Fisher's exact test is
essentially the *z* test when the sample sizes are all large.
(The difference is just the −1 in the denominator of the finite
population correction, which is negligible if the samples are large.)
That is, the *z* score in the normal approximation to Fisher's exact
test is almost exactly equal to the *z* score in the *z*-test for
equality of two percentages using independent samples:
The two tests reject for essentially the same observed data values.

The following example illustrates the approximate equivalence between the *z* test
and the normal approximation to Fisher's exact test.
The example is dynamic:
The data will tend to change when you reload the page, to provide more
examples of the computations involved.

It is rather surprising that tests derived under different assumptions behave so similarly. Generally, when the assumptions of a test are violated, the nominal significance level will be incorrect and the test should not be used. This is a rare exception.

Suppose two variables, C and T,
are defined for a group of N
individuals: c_{i}
is the value of C for the ith
individual, and t_{i} is the
value of T for the ith individual,
i=1, 2, …, N.
Suppose each c_{i} and each t_{i}
can equal either 0 or 1, so that

p_{c}=(c_{1} + c_{2} +
… + c_{N})/N

is the population percentage of the values of C, and

p_{t}=(t_{1} + t_{2} +
… + t_{N})/N

is the population percentage of the values of T.
A simple random sample of size n_{t} will be taken from the population.
The values of t_{i} are observed for the units in the sample;
for the N−n_{t} units not in the sample,
the values of c_{i} are observed instead.
This is the *randomization model* for evaluating whether a treatment has
an effect in an experiment in which a fixed set of N units are
assigned at random either to treatment or to control.
The response of individual i is
t_{i} if he is treated
and c_{i} if not.
At issue is whether the treatment has an effect.
The null hypothesis is that treatment does not matter at all:
c_{i}=t_{i}, for every individual i.
Let G be the sum of all the observations, the observed values of
c_{i} plus the observed values of
t_{i}.
Let X_{t} be the sum of the observed values
of t_{i}.

If the null hypothesis is true, the n_{t} observed values of
t_{i} are like
a random sample from a 0-1 box of N tickets of which
G are labeled 1.
Thus X_{t} has an hypergeometric distribution with parameters
N, G, and n_{t}.
Fisher’s exact test uses X_{t} as the test statistic, and this
hypergeometric distribution to select the rejection region.
If the alternative hypothesis is that
p_{t} > p_{c}, then if the alternative
hypothesis is true X_{t} would tend to be larger than it would be if the
null hypothesis is true, so the hypothesis test should be of the form
{Reject if X_{t}>x_{0}},
with x_{0}
chosen so that the test has the
desired significance level.
If the sample sizes are large, it can be difficult to calculate the
rejection region for Fisher's exact test; then the normal approximation
to the hypergeometric distribution can be used to construct a test
with approximately the correct significance level.
In the normal approximation to Fisher's exact test, the rejection
region for approximate significance level a uses the threshold for
rejection

x_{0}=n_{t}×G/N +
z_{1 − α}×f×n_{t}^{½}×(G/N
×(1 − G/N))^{½},

where f is the finite population correction
(N−n_{t})^{½}/(N−1)^{½}
and z_{1−α} is the
1 − α
quantile of the normal curve.
The α quantile of the normal curve,
z_{α}, is the number for which
the area under the normal curve from minus infinity to
z_{α} equals α.
For example, z_{0.05}=−1.645, and
z_{0.95}=1.645.

A *Z*-statistic is a test statistic whose
probability histogram
can be approximated well by a normal curve if the null hypothesis is true.
The observed value of a *Z*-statistic is called the *z*-score.
In Fisher's exact test,

Z =
(X_{t}−n_{t} ×
G/N)/(f×n_{t}^{½}×(G/N ×(1−G/N))^{½})

is a *Z* statistic.

Suppose one wants to test the null hypothesis that two population percentages
are equal, p_{t}=p_{c}, on
the basis of independent random samples with replacement
from the two populations.
This is the population model for comparing two population percentages.
Let n_{t} denote the size of the random sample from the first population;
let n_{c} be the size of the sample from the second population; and
let N=n_{t}+n_{c} be the
total sample size.
Let X_{t} denote the sample sum of the first sample;
let X_{c} denote the sample
sum of the second sample; and let

G=X_{t}+X_{c}

denote the sum of the two samples.
Conditional on the value of G, the probability distribution of
X_{t} is
hypergeometric with parameters N,
G, and n_{t},
so Fisher's exact test can be used to test the null hypothesis.
There is a different approximate approach based on the normal approximation
to the probability distribution of the sample percentages:
Let φ_{t} denote the sample percentage of the sample from the
first population;
let φ_{c} denote the sample percentage of the sample from the
second population; and let φ denote the overall sample percentage of the two
samples pooled together,

φ=(total number of "1"s in the two samples)/(total sample size) = G/N.

Then, if the null hypothesis is true,

E(φ_{t}−φ_{c})=0.

If in addition n_{t} and
n_{c}
are large,
SE(φ_{t}−φ_{c}) is approximately

s^{*}×(1/n_{t} +
1/n_{c})^{½},

where

s^{*}=(φ×(1−φ))^{½}

is the *pooled bootstrap estimate* of the SD of the null box.
Under the null hypothesis, for large sample sizes n_{t}
and n_{c}, the probability histogram of

Z =
(φ_{t}−φ_{c})/(s^{*} ×
(1/n_{t}
+ 1/n_{c})^{½})

can be approximated accurately by the normal curve, so Z is a
Z-statistic.
To test the null hypothesis against the one-sided alternative that
p_{t}<p_{c}
at approximate significance level α, use a one-sided test that rejects
the null hypothesis when
Z<z_{α}.
To test the null hypothesis against the one-sided alternative that
p_{t}>p_{c} at approximate
significance level α, use a one-sided test
that rejects the null hypothesis when Z>z_{1−α}.
To test the null hypothesis against the two-sided alternative that
p_{t}≠p_{c} at
approximate significance level α, use a
two-sided test that rejects the null hypothesis when
|Z|≥z_{1−α/2}.
The *Z* test for the equality of two percentages is essentially equivalent
to the normal approximation to Fisher's exact test when the sample sizes
are all large, even though the assumptions of the tests differ.

- 0-1 box
- alternative hypothesis
- binomial distribution
- bootstrap estimate
- complement
- control group
- dependent
- expected value
- experiment
- finite population correction
- Fisher’s exact test
- histogram
- hypergeometric distribution
- hypothesis testing
- independent
- independent random sample
- normal approximation
- normal curve
- null hypothesis
- one-sided
- one-sided
- parameter
- pooled bootstrap estimate of the SD
- population model
- population percentage
- probability
- probability distribution
- probability histogram
- probability histogram
- quantile of the normal curve
- random sample
- random variable
- randomization model
- rejection region
- sample percentage
- sample size
- significance level
- simple random sample
- standard deviation
- standard error
- standard unit
- symmetric
- test statistic
- treatment
- treatment group
- two-sided
*Z*statistic*z*-score*z*test