• The randomization model for comparing two treatments (including a treatment against control), for quantitative responses. Alternative hypotheses: shift, dispersion, omnibus. The ticket model: two fixed numbers per individual. Strong and weak null hypotheses.
• Permutation test based on the sample sum of the responses of the treatment group. Approximating P-values by simulation; connection to bootstrap tests.
• The 2-sample t-test in the randomization model. The permutation t-test.
• Fisher's Exact Test and its normal approximation; the Lady Tasting Tea experiment

References: Lehmann, E.L., 1998. Nonparametrics: Statistical Methods Based on Ranks. Upper Saddle River, N.J.: Prentice Hall; SticiGui Chapter 19.

## Comparing two treatments (e.g., treatment and control) in the randomization model

There are N subjects. The subjects are given; they are not necessarily a sample from some larger population. We assign a simple random sample of size n of the N subjects to treatment, and the remaining m=Nn subjects to control. For each subject, we observe a quantitative response. There is no assumption about the values of that quantitative response; they need not follow any particular distribution. The null hypothesis is that treatment does not matter. Several alternatives are interesting. The most common are the shift alternative, the dispersion alternative, and the omnibus alternative. The shift alternative is that treatment changes the mean response. (There are left-sided, right-sided and two-sided versions of the shift alternative.) The dispersion alternative is that treatment changes the scatter of the responses. The omnibus alternative is that treatment changes the response in some way—any way whatsoever.

Because of the deliberate randomization, whether treatment affects the response within the group of N subjects can be addressed rigorously. Up to sampling error—which can be quantified—differences between the responses of the treatment and control groups must be due to the effect of treatment; the randomization tends to balance other factors that affect the responses, and that otherwise would lead to confounding. However, conclusions about the effect of treatment among the N subjects cannot be extrapolated to any other population, because we do not know where the subjects came from (how they came to be part of the experiment).

We model the experiment as follows: Each of the N subjects is represented by a ticket with two numbers on it, a left and a right number. The left number is the response the subject would have if assigned to the control group; the right number is the response the subject would have if assigned to the treatment group. These numbers are written on the tickets before the experiment starts. Assigning the subject to treatment or control only determines whether we observe the left or the right number for that subject. Let xj be the left number on the jth ticket and let yj be the right number on the jth ticket. The strong null hypothesis is

xj = yj,    j = 1, 2, … , N.

That is, the strong null hypothesis is that the left and right numbers on each ticket are equal. Subject by subject, treatment makes no difference at all. The weak null hypothesis is that the average of the left numbers equals the average of the right numbers:

(x1 + x2 + … + xN)/N  =  (y1 + y2 + … + yN)/N.

In the weak null hypothesis, treatment makes no difference on average: treatment might increase the responses for some individuals, provided it decreases the responses of other individuals by a balancing amount.

In this ticket model, if the strong null hypothesis is true, the assignment to treatment or control is an arbitrary random labeling of the subjects. The N responses are the same no matter which subjects are assigned to treatment. Let {Y1, … , Yn} denote the responses of the n treated subjects. Under the strong null hypothesis, xj = yj, j = 1, … , N, where xj is the observed response of the jth subject. Now

Yj = xkj,    j = 1, … , n,

where {kj: j=1, 2, … , n} is a subset of size n of the integers {1, 2, … , N}. Consider the sum of the treatment responses, the statistic

Y = Y1 + Y2 + … + Yn.

Under the strong null hypothesis, the expected value of Y is n times the average of all N responses. (For a review of the probability distribution of the sample sum of random draws from a finite population of numbers, see Chapters 12-15 of SticiGui.) If treatment tends to increase the response, we would expect Y to be larger than that; if treatment tends to decrease the response, we would expect Y to be smaller than that. We can use Y as the test statistic to test the strong null hypothesis. For the alternative hypothesis that treatment increases the response, the test would be of the form

Reject the strong null hypothesis if Yc,

where c is chosen so that the test has the desired significance level α, i.e., so that if the strong null hypothesis is true,

P(Y > c) ≤ α.

This is a permutation test based on Y, the sum of the treatment responses. Note that the null distribution of Y (and thus the critical value c) depends on the observed responses. Similarly, to test against the alternative that treatment decreases the response, the test would be of the form

Reject the strong null hypothesis if Yc,

with c chosen to attain the desired significance level. To test against the alternative that treatment increases or decreases the response, we could use a test of the form

Reject the strong null hypothesis if Y < c1 or Y > c2.

There is some latitude in choosing c1 and c2; two standard choices are to pick them symmetrically about the expected value of Y under the null hypothesis, or to pick them so that under the null hypothesis

P(Y < c1) is as close as possible to P(Y > c2),

subject to the constraint that the significance level is attained. (Because the permutation distribution of Y is discrete, it can happen that no c, or (c1, c2) yields exactly the desired significance level. Then one either can choose the largest attainable significance level that does not exceed the nominal significance level, or one can use a randomized test to attain the desired significance level exactly. We won't worry about this much.)

For example, suppose that N=5, n=2, m=3, and the responses are as follows:

The two treated subjects have response values 3 and 4, and the three controls have response values 1, 2, and 4. So the observed value of Y is 7. Under the strong null hypothesis, the possible values of Y are 3, 4, 5, 6, 7, and 8, because (if the strong null hypothesis is true) the treatment responses are a simple random sample of size 2 from the set {1, 2, 3, 4, 4}. There are 5C2 = 10 ways of assigning two subjects to treatment and the others to control. Each has probability 1/10 by design. The null distribution of T is given in

Under the strong null, the probability that Y is 7 or greater is 0.3; this is the P-value for a one-sided permutation test against the alternative that treatment increases the response, based on Y, the sum of the responses of the treatment group.

For larger data sets, working out the null distribution of Y analytically can be difficult. For example, if we have 100 subjects of whom 50 are to be assigned to treatment, there are about 1029 possible assignments of subjects to treatment or control. Each assignment requires on the order of 50 floating point operations to compute Y. Using a computer that can calculate at the rate 1GFlop/s (109 floating point operations per second), it would take about 1.6×1014 years to compute all the values of Y. The universe is about 1.5×1010 years old. For N=50, n=25, there are 50C25 possible ways to assign the subjects to treatment, which could give as many as 1014 different values for Y; it would take a couple of months to compute all the values of Y at that rate. In such situations, calculating the null distribution exactly is impossible, but we still can approximate the null probability distribution of Y with a variety of numerical simulations.

The most straightforward conceptually is to generate simple random samples of size n from the N (pooled control and treatment) responses repeatedly, calculate Y for each, and find the empirical distribution of those "observed" values. This approximates the null distribution of Y for the permutation test. Alternatively, we might sample n of the responses with replacement; this leads to a bootstrap test. When N is large and n is small compared to N, it does not make much difference whether sampling is with or without replacement: the permutation and bootstrap tests should give similar P-values. When N is small and when n is an appreciable fraction of N, the tests differ. (See Romano, J.P., 1989. Bootstrap and randomization tests of some nonparametric hypotheses, Ann. Stat., 17, 141–159. More on this later.)

is an applet that allows you to simulate the sampling distribution of the sum of n draws with or without replacement from a population of N values. That lets you simulate the null distribution of the permutation test and the bootstrap test based on the sum of the treatment responses.

You need Java to see this.

Each time you press the Take Sample button at the top of the applet, the computer draws a pseudorandom sample from the box on the right, calculates the sample sum, and adds its contribution to a histogram of observed values in the middle. A tic box at the top controls whether the sampling is with or without replacement (sampling without replacement corresponds to the permutation test; sampling with replacement corresponds to the bootstrap test). Press the Take Sample button a few times to get the feel, then change the Take ______ Samples box at the bottom of the plot from 1 to 1000, and press the Take Sample button until you have drawn 10,000 samples of size 2. The observed relative frequencies of the values of the sample sum should be quite close to the probabilities in table

The applet also lets you simulate the null distribution of Y in situations in which working out the null distribution analytically would be impractical. How accurate is the simulation? Ignoring issues with the pseudorandom number generator (not all algorithms for simulating random numbers behave well), we can get a handle on the probable accuracy of the simulation as follows: The standard error of the empirical probability in k independent trials with the same true probability p of success is

(p(1−p)/k)1/2 ≤ 1/(2k1/2).

For k=10,000 trials, this bound on the standard error is half a percent.

## Pseudo-Random Number Generators

Most computers cannot generate truly random numbers, although there is special equipment that can (usually, these rely on a physical source of "noise," such as a resistor or a radiation detector). Most so-called random numbers generated by computers are really "pseudo-random" numbers, sequences generated by a software algorithm from a starting point, called a seed. Pseudo-random numbers behave much like random numbers for many purposes.

The seed of a pseudo-random number generator can be thought of as the state of the algorithm. Each time the algorithm produces a number, the it alters its state—deterministically. If you start a given algorithm from the same seed, you will get the same sequence of pseudo-random numbers. Each pseudo-random number generator has only finitely many states. Eventually—after the period of the generator, the generator gets back to its initial state and the sequence repeats.

Better generators have more states and longer periods, but that comes at a price: speed. There is a tradeoff between the computational efficiency of a pseudo-random number generator and the difficulty of telling that its output is not really random (measured, for example, by the number of bits one must examine). See http://csrc.nist.gov/rng/ for a suite of tests of pseudo-random number generators. Tests can be based on statistics such as the number of zero and one bits in a block or sequence, the number of runs in sequences of differing lengths, the length of the longest run, spectral properties, compressibility (the less random a sequence is, the easier it is to compress), and so on.

No pseudo-random number generator is best for all purposes. You should check which algorithm is used by any software package you rely on for simulations. The Linear Congruential Generator, which used to be quite common, is best avoided. (It tends to have a short period, and the sequences it generates have underlying regularity that can spoil its performance for many purposes.) For statistical simulations, a particularly good pseudo-random number generator is the Mersenne Twister. For cryptography, a higher level of randomness is needed than for most statistical simulations.

Here is a Matlab function to estimate by simulation the P-value for a permutation test based on the sum of the treatment responses. The alternative is one-sided: that treatment increases the response.

```        function p = simPermuTest(x, y, iter)
% function p = simPermuTest(x, y, iter)
%
% P.B. Stark, www.stat.berkeley.edu/~stark 9/12/05
% simulated P-value for a one-sided permutation test based on the sum of
% the treatment responses for the strong null hypothesis that treatment
% has no effect whatsoever
%
% x is the vector of control observations
% y is the vector of treatment observations
% the lengths of x and y determine the sizes of the control and treatment
%  groups
% iter is the number of replications.
%
ts = sum(y);                      % test statistic
z = [x y];                        % pooled responses
dist = zeros(1, iter);            % holds results of simulation
for i=1:iter                      % loop over permutations of responses
zp = z(randperm(length(z)));  % random permutation of responses
dist(j) = sum(zp(1:length(y))); % value of the test statistic here
end;
p = sum(dist >= ts)/iter;         % simulated P-value, one-sided test
return;
```

Here is a terse R version of the same algorithm:

```        simPermuTest <- function(x, y, iter) {
ts <- sum(y)                 # test statistic
z <- c(x, y)                 # pooled responses
sum(replicate(iter, (sum(sample(z)[1:length(y)]) >= ts)))/iter
}
```

We have been working with the sum Y of the treatment responses. We could just have well worked with the difference between the mean of the treatment responses and the mean of the control responses:

D = (Y1 + Y2 + … + Yn)/n(X1 + X2 + … + Xm)/m,

where the X's are the control responses. This leads to a permutation test based on the difference in mean responses, instead of a permutation test based on the sum of the treatment responses. The two tests are equivalent—they reject for exactly the same data sets—because there is a monotonic, 1:1 mapping between values of Y and values of D: let x=x1 + x2 + … + xN be the total of the responses of all N individuals. Then

D = Y/nX/m

= Y/n − (xY)/m

= (1/n + 1/m) Yx/m.

Thus D can be computed from Y and the total response, and increases monotonically with Y. Specifying which n of the {xj} are treatment responses determines the sum of the treatment responses, and also determines the difference of mean responses. Increasing the sum of the trestment responses increases the mean of the treatment responses, and that increase is at the expense of the mean of the control responses.

### Power of the permutation test based on Y

Generally, we have to make pretty strong assumptions to find the power of the permutation test, and it is easier to do in the population model we study in the next chapter. However, consider the power against the shift alternative that treatment increases every subject's response by the same amount d. This alternative is rather far-fetched, but it is strong enough to specify the distribution of the test statistic Y: Under this alternative, the right number on each ticket is the left number on the ticket, plus d. Because we know either the left or right number for each ticket, we actually know both numbers on all the tickets: The right numbers on the tickets for the treatment group were observed; the left numbers on those tickets are the right numbers minus d. The left numbers on the tickets for the control group were observed; the right numbers on those tickets are the left numbers plus d. Under random assignment to control or treatment, we can find the probability distribution of the sum of the right numbers on n of the tickets. The difficulty in translating this sampling distribution into the power is that the critical value of Y for the test depends on the observations: The critical value is the 1−a quantile of the sampling distribution of Y on the assumption that the strong null hypothesis is true—conditional on the observed values—but that assumption leads to a different critical value depending on which subjects are assigned to treatment. The simulation thus needs to include the step of estimating the critical value, for each random assignment of subjects to treatment and control.

Here is a Matlab script to approximate the power of a permutation test based on Y against the shift alternative using nested simulations.

```        function p = simPermuSumPower(x, y, d, alpha, iter)
% function p = simPermuSumPower(x, y, d, alpha, iter)
%
% P.B. Stark, www.stat.berkeley.edu/~stark 9/11/05
% finds the approximate power of a one-sided level alpha permutation
% test based on the sample sum for the alternative hypothesis
% that treatment shifts the response by exactly d for all subjects.
%
% x is the vector of control observations
% y is the vector of treatment observations
% the lengths of x and y determine the sizes of the control and treatment
%  groups
% alpha is the significance level
% iter is the number of replications of each simulation--note: there are
%  two nested simulations.
p = 0;                            % simulated power
z = [x y-d];                      % control responses, under shift alternative
for i=1:iter                      % loop over sampling from responses
dist = zeros(1,iter);         % storage for empirical distribution
zp = z(randperm(length(z)));  % randomize
xp = zp(1:length(x));         % control responses in this iteration
yp = zp(length(x)+1:length(x)+length(y)) + d;  % treatment responses
for j = 1:iter                % loop to simulate critical value
zp = [xp yp];             % population of responses in this simulation
zp = zp(randperm(length(zp)));   % randomize
dist(j) = sum(zp(1:length(yp))); % value of the test statistic here
end;                          % end loop to simulate critical value
if (sum(dist >= sum(yp))/iter <= alpha)  % reject this time?
p = p+1;                  % yes; add a rejection
end;
end;
p = p/iter;                       % estimate of P(reject| shift alternative)
return;
```

Here is an R version of the same algorithm:

```        simPermuSumPower <- function(x, y, d, alpha, iter) {
# P.B. Stark, www.stat.berkeley.edu/~stark 10/27/05
# finds the approximate power of a one-sided level alpha permutation
# test based on the sample sum for the alternative hypothesis
# that treatment shifts the response by exactly d for all subjects.
z <- c(x, y-d)               # pooled responses, treatment effect removed
test <- function(z, n, d, alpha, iter) {
z[1:n] <- z[1:n] + d;    # restore treatment effect
ts <- sum(z[1:n]);       # test statistic
sum(replicate(iter, ( sum(sample(z)[1:n]) >= ts) )/iter) < alpha
}
sum(replicate(iter, test(sample(z), length(y), d, alpha, iter) ))/iter
}
```

We can simulate the distribution of Y under the shift alternative using the sampling applet as follows: Add d to each control response. Pool those m numbers and the n treatment responses; these are the right numbers on all N=n+m tickets. Put these N numbers into the "box" on the right side of the applet. The distribution of the sample sum of n numbers drawn at random from this population of numbers is the distribution of Y under the shift alternative. Note that this alternative hypothesis—that treatment adds exactly the same amount to the response for each subject—is absurdly restrictive. In the population model, discussed later, we can find the power under more realistic assumptions.

## Comparison with the 2-sample t-test

It is common to use the t-test to compare two groups. In the randomization model, the nominal significance level of the t-test derived from Student's t distribution can be off by a lot, especially when the control and treatment groups are small. The null hypothesis for the t-test is that the control and treatment responses are an iid random sample from a normal distribution with unknown mean and variance. In the randomization model, there is no reason to think that the subjects are a random sample from any population, much less one in which responses have a normal distribution. And in the randomization model, responses are not independent, because putting a subject into the treatment group excludes that subject from the control group, and vice versa.

By simulation, we can approximate the significance level of a 2-sample t-test in the randomization model. Here is R code that does it.

```        simPermuTSig <- function(x, y, sig, iter) {
# P.B. Stark, www.stat.berkeley.edu/~stark 11/9/05
# simulated true significance level of a nominal level-sig 1-sided
# 2-sample t-test under the randomization model.
z <- c(x, y)                 # pooled responses
n <- length(y)
N <- length(z)
sum(replicate(iter, {zp <- sample(z);
tst <- t.test(zp[1:n], zp[(n+1):N],
var.equal=TRUE,
alternative="greater");
as.integer(tst\$p.value <= sig)
}))/iter
}
```

When the control and treatment groups are large, provided the responses are not highly skewed, the significance level of the 2-sample t-test as a test of the strong null hypothesis in the randomization model is often quite close to the nominal level of the test, even though the null hypothesis of the t-test has little connection to the strong null hypothesis in the randomization model. However, the actual and nominal significance levels need not be close. For example, for the data we have been using—treatment responses 3 and 4 and control responses 1, 2, and 4—the significance of a nominal level 0.1 2-sample t-test is very close to 0.1. But the real significance level of a nominal level 0.05 2-sample t-test is also very close to 0.1 (not to 0.05). There are only 5C2=10 possible samples, and some of those are indistinguishable because the value 4 occurs twice among the responses. Thus the null distribution of the t-statistic under the strong null hypothesis in the randomization model must be quite chunky, while the Student t-distribution is continuous. The discreteness of the null distribution limits the number of significance levels that are attainable. Table illustrates the problem.

Of course, we could base a test on the sampling distribution of the 2-sample t-statistic under the randomization model, rather than on Student's t distribution. The resulting test is called a permutation t-test. It differs from permutation tests based on the sample sum or the difference of sample means, but often not by much.

Here is R code to find the simulated P-value of a 1-sided 2-sample permutation t-test.

```        simPermuTTest <- function(x, y, iter) {
tStat <- function(z, n) {
N <- length(z);
m <- N-n;
sPool <- sqrt(((m-1)*var(z[1:m])+(n-1)*var(z[(m+1):N]))/(N-2));
(mean(z[(m+1):N])-mean(z[1:m]))/(sPool*sqrt(1/n + 1/m))
}
n <- length(y);              # number of treated subjects
z <- c(x, y)                 # pooled responses
ts <- tStat(z,n)             # test statistic
sum(replicate(iter, (tStat(sample(z),n)>=ts)))/iter
}
```

## Fisher's Exact Test and its Normal Approximation

A special case of the permutation test based on the sample sum occurs when the only possible responses are 0 and 1. The distribution of the test statistic under the strong null hypothesis is then hypergeometric, which leads to Fisher's Exact test. The following material is adapted from SticiGui.

Suppose we own a start-up company that offers e-tailers a service for targeting their web advertising. Consumers register with our service by filling out a form indicating their likes and dislikes, gender, age, etc. We put "cookies" on their computers to keep track of who they are. When they get to the website of any of our clients, we use their likes and dislikes to select (from a collection of the client's ads) the ad we think they are most likely to respond to. The service is free to consumers; we charge the e-tailers.

We can raise venture capital if we can show that targeting makes e-tailers' advertisements more effective. To measure the effectiveness, we offer our service for free to a large e-tailer. The e-tailer has a collection of web ads that it usually uses in rotation: each time a consumer arrives at the site, the e-tailer's server selects the next ad in the sequence to show to the consumer, then starts over when it runs out of ads.

We install our software on the e-tailer's server, in the following way: each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software suggests, and with probability 50% the server shows the consumer the next ad in the rotation, the way the e-tailer used to choose which ad to show. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. We call the consumers who were shown the targeted ad the treatment group; we call the other consumers the control group. If a consumer visits the site more than once during the trial period, we ignore all of that consumer's visits but the first.

Suppose that N consumers visit the site during the trial, that n of them are assigned to the treatment group, that m of them are assigned to the control group, and that NS of the consumers buy something. In essence, we want to know whether there would have been more purchases if everyone had been shown the targeted ad than if everyone had been shown the control ad. Only some of the consumers saw the targeted ad, and only some saw the control ad, so answering this question involves extrapolating from the data to an hypothetical counterfactual situation. Of course, we really want to extrapolate further, to people who have not yet visited the site, to decide whether more product would be sold if those people are shown the targeted ad.

We can think of the experiment in the following way: the ith consumer has a ticket with two numbers on it: the first number (xi) is 1 if the consumer would have bought something if shown the control ad, and 0 if not. The second number (yi) is 1 if the consumer would have bought something if shown the targeted ad, and 0 if not. There are N tickets in all.

For each consumer i who visits the site, we observe either xi or yi, but not both. The percentage of consumers who would have made purchases if every consumer had been shown the control ads is

pc = ( x1 + x2 +  …  + xN )/N.

Similarly, the percentage of consumers who would have made purchases if every consumer had been shown the targeted ads is

pt = ( y1 + y2 +  …  + yN )/N.

Let

µ = ptpc

be the difference between the rate at which consumers would have bought had all of them been shown the targeted ad, and the rate at which consumers would have bought had all of them been in the control group. The null hypothesis, that targeting does not make a difference, is that µ = 0. (The strong null hypothesis is that xi = yi, for i=1, 2, … , N.) The alternative hypothesis, that targeting helps, is that µ > 0. We would like to test the null hypothesis at significance level 5%.

Let m be the number of consumers in the control group, and let n be the number of consumers in the treatment group, so

N = m + n.

Let NS be the total number of sales to the treatment and control groups. Let Y be the number of sales to consumers in the treatment group. Under the strong null hypothesis (which implies that µ = 0), for any fixed value of NS, Y has an hypergeometric distribution with parameters N, NS, and n (we consider N to be fixed):

P(Y=y) = NSCy × NNSCny/NCn.

This is a situation with many tied data, because there are only two possible responses for each subject: 0 if the subject does not buy anything, and 1 if the subject buys something. If the alternative hypothesis is true, Y will tend to be larger than it would if the null hypothesis be true, so we should design our test to reject the null hypothesis for large values of Y. That is, our rejection region should contain all values above some threshold value y; we will reject the null hypothesis if Y > y.

We cannot calculate the critical value y until we know N, n, and NS. Once we observe them, we can find the smallest value y so that the probability that Y is larger than y if the null hypothesis be true is at most 5%, the significance level we chose for the test. Our rule for testing the null hypothesis then would be to reject the null hypothesis if Y > y, and not to reject the null hypothesis otherwise. This is called Fisher's exact test for the equality of two percentages (against the one-sided alternative that treatment increases the response). It is a permutation test, and it is also essentially a (mid-) rank test (discussed in the next chapter), because there are only two possible values for each response.

### The Normal Approximation to Fisher's Exact Test

If N is large and n is neither close to zero nor close to N, computing the hypergeometric probabilities will be difficult, but the normal approximation to the hypergeometric distribution should be accurate provided NS is neither too close to zero nor too close to n. To use the normal approximation, we need to convert to standard units, which requires that we know the expected value and standard error of Y. The expected value of Y is

E(Y) = n × NS/N,

and the standard error of Y is

SE(Y) = f × n½ × SD,

where f is the finite population correction

f = ( (Nn)/(N − 1) )½,

and SD is the standard deviation of a list of N values of which NS equal one and (NNS) equal zero:

SD = ( NS/N × (1 − NS/N) )½.

In standard units, Y is

Z = (YE(Y))/SE(Y)

= ( Yn × NS/N )/(f × n½ × SD).

The area under the normal curve to the right of 1.645 standard units is 5%, which corresponds to the threshold value

y = E(Y) + 1.645 × SE(Y)

= n × NS/N + 1.645 × f × n½ × SD

in the original units, so if we reject the null hypothesis when

Z > 1.645

or, equivalently,

Y > n × NS/N + 1.645 × f × n½ × SD,

we have an (approximate) 5% significance level test of the null hypothesis that ad targeting and ad rotation are equally effective. This is the normal approximation to Fisher's exact test; Z is called the Z statistic, and the observed value of Z is called the z score.

## Fisher's Lady Tasting Tea Experiment

In his 1935 book, The Design of Experiments (London, Oliver and Boyd, 260pp.), Sir R.A. Fisher writes:

A LADY declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. …

Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of what the test will consist, namely that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is in an order not determined arbitrarily by human choice, but by the actual manipulation of the physical apparatus used in games of chance, cards, dice, roulettes, etc., … Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.

There are 8C4 = 70 ways to distribute the four "milk-first" cups among the 8. Under the null hypothesis that the lady cannot taste any difference, her labeling of the 8 cups—4 milk-first and 4 tea-infusion-first—can be thought of as fixed in advance. The probability that her labeling exactly matches the truth is thus 1/70. A test that rejects the null hypothesis only when she matches all 8 cups has significance level 1/70 = 1.4%. If she misses one cup, she must in fact miss at least two, because she will have mislabeled a milk-first as tea-first, and vice versa. The possible numbers of "hits" are 0, 2, 4, 6, and 8. To get 6 hits, she must label as milk-first three of the four true milk-first cups, and must mislabel as milk-first one of the four tea-first cups. The number of arrangements that give exactly 6 hits is thus

4C3×4C1 = 16.

Thus if we reject the null hypothesis when she correctly identifies 6 of the 8 cups or all 8 of the 8 cups, the significance level of the test is (16+1)/70 = 24.3%. Such good performance is pretty likely to occur by chance—about as likely as getting two heads in two tosses of a fair coin—even if the lady and the tea are in different parlors.

There are other experiments we might construct to test this hypothesis. Lindley (1984, A Bayesian Lady Tasting Tea, in Statistics, an Appraisal, H.A. David and H.T. David, eds., Iowa State Univ. Press) lists two, which he attributes to Neyman:

1. Present the lady with n pairs of cups of tea with milk, where one cup in each pair (determined randomly) has milk added first and one has tea added first. Tell the lady that each pair has one of each kind of cup; ask her to identify which member of each pair had the milk added first. Count the number of pairs she categorizes correctly. (This approach is sometimes called two-alternative forced choice in the study of perception in the psychometric literature: each trial has two alternative responses, the subject is forced to pick one.)
2. Present the lady with n cups of tea with milk, each of which has probability 1/2 of having the milk added first and probability 1/2 of having the tea added first. Do not tell the lady how many of each sort of cup there are. Ask her to identify for each cup whether milk or tea was added first. Count the number of cups she categorizes correctly.

These experiments lead to different tests. A permutation test for the first of them is straightforward; to base a permutation test on the second requires conditioning on the number of cups she categorizes each way and on the number of cups that had the milk added first.

## Strong and weak null hypotheses

We are in the randomization model: each subject has a ticket with two numbers on it. Those 2×N numbers are the parameters of the model. If the subject is assigned to control, the number on the right is revealed. Assignment to treatment reveals the number on the left. The assignment is by simple random sample.

In the randomization model, the strong null hypothesis is that the two numbers on each ticket are equal: treatment makes no difference whatsoever, subject by subject.

This is not a very realistic hypothesis, but it leads to simple theory. If the strong null is true, when you see one number on each ticket, you know both numbers on each ticket—all 2N parameters in the model—because the two numbers on each ticket are equal. If the strong null is true, once the data have been collected, you know everything there is to know. Since tickets are put into treatment by simple random sampling, that completely specifies the probability distribution of every test statistic.

The strong null implies a variety of weaker null hypotheses—but those hypotheses do not completely specify the probability distribution of every test statistic. For example, if the strong null is true, so is the weaker null that the mean of all N subjects' responses to treatment is equal to the mean of all N subjects' responses to control: the mean of all the left numbers is equal to the mean of all the right numbers. That's a natural "weak null" when the alternative is that treatment tends to increase (or to decrease) the response. That weak null does not uniquely determine the probability distribution of the statistics we have considered, because it does not determine all the parameters in the model: under that weak null, we have N+1 constraints (we observe one number on each ticket, and we know that the two population means are the same)—but we need to have 2N constraints to determine all 2N parameters in the model.

Because the strong null plus the data determine all the parameters in the randomization model, under the strong null, it's easy to calculate the distribution of the test statistic, set critical values and find P-values, etc. If the strong null is true, so is the weak null—but the strong null can be false while the weak null is true. There are parameter values that satisfy the weak null but not the strong null. So, if we test the weak null using a test designed for the strong null, we may well reject too often—the significance level could be larger than we claim. For parameters that satisfy the weak null but not the strong null, the chance that the test statistic exceeds the nominal level-alpha critical value could be much larger than for any parameters that satisfy the strong null.

The "weakening" that is interesting depends partly on the alternative. The idea is that the weak null is implied by the strong null (hence, "weaker") but is false if the alternative is true (hence, would make a reasonable null in contrast to the alternative of interest).

If the alternative is that treatment tends to increase the response, a natural weakening is that the two population means are equal. (If the strong null holds, so does this weakening. If the alternative holds, this weak null is false.) If the alternative is that treatment tends to change the variability, a natural weakening is that the two population variances are equal. (Again, if the strong null is true, so is this weakening; but if the alternative holds, this weakening is false.) If the alternative is that treatment has any effect whatsoever on the distribution of responses, a natural weakening is that the cdfs of the two populations are equal. And so on.