Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

So far, we have been considering the subjects to be a given collection of N individuals, assigned at random to treatment and control. This leads to simple probability calculations under the strong null hypothesis, because the only source of randomness is in the assignment, which behaves like a simple random sample of size n from a fixed population of N subjects. However, conclusions about the effect of treatment apply only to the N subjects studied. If we wish to generalize beyond these N to a larger population (which is often the goal), we need to know how the subjects in the study are related to that larger population.

In this chapter we consider what happens if the N subjects are themselves a simple random sample from that larger "parent" population. We shall assume that the larger population is much much larger than N. Of the N subjects drawn from the parent population, a random sample of size n is assigned to treatment, and the other m = N−n are assigned to control. As before, let {Y₁, … , Y_n} denote the responses of the subjects assigned to treatment and let {X₁, … , X_m} denote the responses of the subjects assigned to control. Because the subjects are drawn at random from the parent population, the probability distribution of the responses of the treated subjects is the same, as is the probability distribution of the responses of the control subjects.

The responses are not strictly independent because the subjects are drawn without replacement from the larger population, but if the larger population is large enough compared to N, it is reasonable to treat the responses as if they were independent. Under this assumption, we have the population model for comparing two treatments.

Let F_X denote the cumulative distribution function (cdf) of the control responses and let F_Y denote the cdf of the treatment responses. The strong null hypothesis in this context is that F_X=F_Y. A weaker null hypothesis is that EX=EY, that is, that the expected values of the two distributions are the same. If F_X and F_Y are continuous distributions then the probability of ties among the data is zero.

Let N=n+m. Suppose {X₁, … , X_m, Y₁, … , Y_n} are independent, identically distributed continuous random variables, and let {S₁, … , S_n} be the ranks of the Y's among the N variables. Then for any subset {s₁, … , s_n} of size n of the integers {1, 2, … , N},

For a proof, see Lehmann (1998, p. 58). The crucial idea is that under the strong null hypothesis, the distribution of all N variables is symmetric, so the n ranks {s₁, … , s_n} are equally likely to be the ranks of any n of the N variables.

Thus all the results about the null distribution of rank-based statistics (such as the Wilcoxon rank-sum) derived for the randomization model also apply to the population model under the assumption of random sampling and random assignment, provided the parent population is sufficiently large and the distribution of the data is continuous. The resulting tests remain nonparametric: the significance levels do not depend on the form of F=F_X=F_Y.

When the distribution of the data is not continuous, and the probability of ties among the data is greater than zero, the null distribution of the mid-ranks indeed depends on F=F_X=F_Y, because F determines the null probability of tied observations. If we condition on the list of midranks, however, the conditional distribution of the allocation of midranks to treatment and control is the same as it is for the randomization model, so we can construct tests at conditional significance levels using the same computations as before. The unconditional significance level of such tests is a weighted average of the conditional significance levels, because if {A_j: j=1, … , K} is a partition of the whole outcome space, then for any event A (and in particular the event that we commit a type I error),

The unconditional significance level is thus between the smallest and largest conditional significance levels, but it depends on unknown properties of the population distribution F, because the configuration of ties determines the conditional significance levels that are attainable. If there are few observations, a large proportion of which are tied, only large (conditional) significance levels are attainable. The set of attainable conditional significance levels tends to get denser as the sample sizes m and n increase, making it possible to get more nearly unconditional tests. The conditional significance levels are asymptotically unconditional as m and n both increase.

The population model can be used to test the hypothesis that F_X=F_Y even when the assignment to "treatment" is not at random, but is just a labeling of the subjects. However, even if we reject the null hypothesis, we cannot infer why the two distributions differ. For example, suppose we want to test the hypothesis that, in some population, getting more than 50% of one's calories from fat is associated with obesity, as defined by body mass index, BMI. We could take a random sample of people from that population, determine their dietary fat intake by monitoring their diet to determine whether their dietary fat intake exceeds 50% of their daily caloric intake, and measuring their BMI. We can think of this as taking random samples from the subpopulation of people with low dietary fat intake and the subpopulation of people with high dietary fat intake, and measuring a binary response (obesity) for each subject—the two sample sizes are random, but they sum to N. We could use Fisher's exact test to assess (conditionally on the overall incidence of obesity and on n and m) whether the incidence of obesity differs between the two subpopulations. Suppose we conclude that the incidence of obesity does differ. We still should not conclude that high dietary fat intake causes obesity, because many other factors are likely to be different between the two groups, and the "assignment" of subjects to low or high dietary fat intake was not at random—the subjects choose their own diets.

Lehmann (1998, p. 64–65) identifies five models for five problems, all of which lead to the same (conditional) null distribution of ranks:

Power of the Wilcoxon rank-sum test

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

To find the power of the Wilcoxon rank sum test in the randomization model, we needed to assume very specifically how treatment affected each subject, for example, that treatment adds the same amount to each subject's response. The resulting power calculation depended on the actual observed response values. In contrast, in the population model, we can compute the power against some alternatives without knowing the responses, because the distributions F_X and F_Y tell us what to expect the data to be like.

To compute the power, it is most convenient to work with the Mann-Whitney form of the Wilcoxon rank-sum test, which uses the statistic W_XY. Recall that

is the number of (control, treatment) pairs with the control response less than the treatment response. Let c denote the critical value for the one-sided test based on the Mann-Whitney statistic W_XY, so we reject the null hypothesis if

Suppose that the significance level of this test is α, so that under the strong null hypothesis,

Consider the shift alternative, in which treatment increases every subject's response by the same amount d>0. It follows that

Under this alternative, if we know F_X and d, we know F_Y. Consider the power of the Wilcoxon rank-sum test against this alternative, as a function of d, keeping F_X fixed:

This is well defined even when d ≤ 0. We hope that Φ(d), the power of the test against the shift alternative, increases as d increases—that the test has a better chance of finding a large effect than a small effect—and this is indeed true.

Theorem. (Lehmann, 1998, p. 67) Φ_{F_X}(d) is monotonically nondecreasing in d.

Proof. Suppose d₁ < d₂. Let X₁, … , X_m be iid F_X and let Y₁, … , Y_n be iid F_X+d₁. Let Z_j = Y_j + (d₂−d₁), j = 1, … , n. Note that

F_Z(y) = P(Z_j ≤ y) = P(X_i ≤ y − (d₁ + (d₂−d₁)) ) = P(X_i ≤ y − d₂ ) = F_X(y−d₂).

But point by point, Z_j>Y_j, so W_XZ ≥ W_XY, and thus Φ_{F_X}(d₂) ≥ Φ_{F_X}(d₁).

It follows from the theorem that the Wilcoxon rank-sum test is unbiased against shift alternatives with positive shift d: the power against all such alternatives is at least α, the significance level of the test. (A test is unbiased if the chance of correctly rejecting the null hypothesis when the alternative is true is at least as large as the chance of a type I error—erroneously rejecting the null hypothesis when the null hypothesis is true.) The result also shows that the power against shift alternatives with negative shift is no larger than the significance level; i.e., that the one-sided Wilcoxon test is actually a significance level α test of the null hypothesis H: d ≤ 0 against the alternative H': d > 0.

Essentially the same proof shows that if the treatment effect depends on x but is nonnegative for every x, the power of the Wilcoxon test is at least α. That is, suppose that

where d(x) ≥ 0 for all x. Then P(W_XY ≥ c) ≥ α. Similarly, if d(x) ≤ 0 for all x then P(W_XY ≥ c) ≤ α, so the Wilcoxon test is a significance level α test of the null hypothesis H: d(x) ≤ 0 for all x against the alternative hypothesis H': d(x) > 0 for all x

A similar result holds whenever the response to treatment is stochastically larger than the response to control, i.e., whenever F_Y(x) ≤ F_X(x) for all x.

Asymptotic Power and Comparison with Student's t Test

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

Define p₁ = P(X < Y), where X and Y are independent random variables distributed as F_X and F_Y respectively. When F_X and F_Y are continuous distributions, the probability distribution of the standardized Mann-Whitney statistic is asymptotically a standard normal distribution provided p₁ is strictly between 0 and 1. The expected value of W_XY is mnp₁; W_XY/(mn) is an unbiased estimate of p₁.

Define p₂ = P( X < Y and X < Y') and p₃ = P( X < Y and X' < Y), where X, X', Y and Y' are independent, X and X' have distribution F_X, and Y and Y' have distribution F_Y. Then

Var(W_XY) = mnp₁(1−p₁) + mn(n−1)(p₂ − p₁²) + mn(m−1)(p₃ − p₁²).

Both the expected value and variance of the Mann-Whitney statistic depend on properties of the data distributions that are in principle unknown. With additional assumptions, more can be said; there are also approximations. For example, if the critical value of a right-tailed test using the Mann-Whitney statistic is c, then the normal approximation (with continuity correction) to the power of the test against a fixed alternative G≠F is

where z is the standard normal cdf, z_α is the 1−α quantile of the standard normal distribution, and f^* is the density of the difference between two independent random variables each with distribution F_X.

Student's t-test is as follows: reject the hypothesis that treatment has no effect when

where X is the sample mean of the control observations, Y is the sample mean of the treatment observations, and S is the square-root of the pooled sample variance

S² = ( (X₁−X)² + (X₂−X)² + … + (X_m−X)² + (Y₁−Y)² + (Y₂−Y)² + … + (Y_n−Y)² )/(m + n − 2).

The critical value c is the α quantile of Student's t-distribution with m+n−2 degrees of freedom. Under the null hypothesis that the control and treatment observations are independent with normal distributions and have the same variance, that control observations have the same mean, and that the treatment observations have the same mean (which differs from the mean of the control observations by d), Student's t-test has level α and is the uniformly most powerful unbiased level α test of the hypothesis d=0 against the alternative d>0.

Student's t-test is asymptotically distribution-free: even when F_X is not normal, under the null hypothesis that F_X = F_Y, provided the observations are iid and the variance of F_X is finite, the distribution of the t-test statistic is asymptotically normal with mean zero and variance one.

Suppose that F_X is given and consider the shift alternative with shift d. Let n=m and fix the significance level α. Let Φ denote the power of the Wilcoxon test, and consider the sample size n'=m' that would be required for Student's t-test to have power Φ for the same level α, the same shift d, and the same distribution F_X for the control observations. The ratio e=n'/n is called the efficiency of the Wilcoxon test relative to Student's t-test. Low efficiency means many more observations are required for the Wilcoxon test to have the same performance (significance level and power) as Student's t-test.

Perhaps surprisingly, when F_X is normal, the efficiency of the Wilcoxon test is pretty large: Table 2.2 of Lehmann (1998) shows that for n=m=25, α=0.01, and d between about 0.14 and 1.3, the approximate efficiency is between 0.85 and 0.96. For n=m=5 and α=4/126 and d between 0.5 and 4, the efficiency is between about 0.96 and 0.98.

Thus (for this range of parameters) if the true distribution of the control and treatment observations really is normal, little is lost in using the nonparametric Wilcoxon test, while if the true distribution is not normal, the Wilcoxon test has the advantage of maintaining level α, whereas Student's t-test can have a radically different true level.

When the distribution F_X is not normal, the Wilcoxon test can have better performance than Student's t-test. Asymptotically in n and m, the efficiency does not depend on the target power or the significance level, provided the significance is less than the power (the limit is called the Pitman efficiency or the asymptotic relative efficiency). The Pitman efficiency of the Wilcoxon test relative to the Student t-test is about 1.1 for the logistic distribution, 1.5 for the double exponential distribution, 1 for the uniform distribution, and 3 for the exponential distribution (Lehmann, 1998, p. 80). The performance of the Wilcoxon test is remarkably good (compared to that of Student's t-test) for long-tailed distributions. Moreover, the performance of the Wilcoxon test is never much worse than that of the Student t-test: provided the variance of F_X is finite, the asymptotic relative efficiency is at least 0.864 (see Lehmann, 1998, p. 377).

The normal scores test

The idea of the normal scores test is to replace each order statistic by the expected value of the corresponding order statistic of a normal distribution, then to compute Student's t-statistic for those substituted data. The critical values for the resulting test are not the critical values for the Student t-distribution, but they have been tabulated. The remarkable result is that this test, which depends only on the ranks of the observations, has an asymptotic (Pitman) efficiency greater than or equal to one relative to Student's t-test whatever be F_X. The asymptotic relative efficiency is one when F_X is normal.

Estimating the shift d

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

Assume that the effect of treatment is just a shift d. Then distribution of {X_j: j= 1, … m} is the same as that of {(Y_j−d): j= 1, … n}. Consider a two-sided test of H: d=0. Suppose that the null distribution of W_XY is symmetric about some value μ, so we would reject H if W_XY≥μ+c or if W_XY≤μ−c. Then a reasonable estimate of d is the value d' for which the statistic W_X,Y−d' is closest to μ. This value d' turns out to be the median of the mn differences D_ij = Y_j−X_i, if we define the median of an even number of observations to be the mean of the two central observations in the sorted list.

Lehmann (1998) shows that the distribution of the error in this estimator, (d'−d), is independent of d:

The estimator d' is unbiased and median unbiased (at least) if the distribution F_X is symmetric or if n=m, because then the distribution of d' is symmetric about d.

Confidence intervals for the shift d

The approach here is based on the duality between tests and confidence sets. Suppose we have a family of level α tests of the null hypotheses H_d': d=d' for all real d'. Let S(y) denote the set of values of d' for which the test of the corresponding hypothesis d=d' would not rejected the null hypothesis if the data were y (here y is generic; in the problems we have been considering, y specifies all the treatment responses and all the control responses). Then S(Y) is a 1−α confidence set for d.

Proof. If f is the true value of d, then the probability (under f) that the corresponding test rejects the hypothesis d=f is at most α. The probability that the test does not reject is at least 1−α, so with probability 1−α under f, the value f is in S(Y). That is,

The shape and size of the confidence set are tied to the family of hypothesis tests that are being "inverted." No matter what, the procedure gives a 1−α confidence set, but it can be large or bizarre (e.g., consist of disjoint pieces) according to the nature of the tests. The duality between tests and confidence sets can be exploited to produce confidence sets with desirable properties; see, for example, Benjamini and Stark (1996. Non-equivariant simultaneous confidence intervals less likely to contain zero, J. Amer. Stat. Assoc., 91, 329–337); Benjamini, Hochberg and Stark (1998. Confidence Intervals with more Power to determine the Sign: Two Ends constrain the Means, J. Amer. Stat. Assoc., 93, 309–317); or Evans, Hansen and Stark (2005. Minimax Expected Measure Confidence Sets for Restricted Location Parameters, Bernoulli, 11, 571–590).

To use this result to find a confidence set for d, we need a family of level α tests of the hypotheses d=d'.

Consider the Mann-Whitney statistic W_XY. Under the strong null hypothesis, the expected value of W_XY is mn/2. Suppose that the critical values for the two-sided test of the hypothesis that d=0 are mn/2−c and mn/2+c. Note that under the hypothesis that d=d', the distribution of the Mann-Whitney statistic for the data

which we denote W_X,Y−d', is the same as the distribution of W_XY under the strong null hypothesis. A two-sided level-α test of the hypothesis d=d' thus would reject if W_{X, Y−d'} ≤ mn/2−c and if W_{X, Y−d'} ≥ mn/2+c. This is a family of tests on which we can base the confidence set.

We can find a confidence interval for d using these tests as follows: Let d' be the estimated treatment effect, the median of the mn differences D_ij. Subtract d' from the treatment observations; compute the ranks of this new data set; perform a two-sided Wilcoxon rank sum test of the null hypothesis of no treatment effect using these new data. If the test does not reject, include d' in the confidence set. Increase d' systematically until the test rejects, to get the upper endpoint of the confidence interval; decrease d' systematically until the test rejects, to get the lower endpoint of the confidence interval.

It is clear that the only values of d' that could be endpoints of the confidence interval are those that cause the ranks of the treatment observations to change—only those values change the value of W_XY. We can exploit this to find a confidence interval for d with less trial-and-error searching.

To change the value of W_XY−d', we need to change d by enough to change the number of pairs (i, j) such that X_i<Y_j−d'. The values of d' that change the ranks are precisely the differences

Let D₍₁₎, … , D_(mn) denote the order statistics of the mn differences D_ij, and define D₍₀₎ = −∞ and D_(mn+1)= ∞. Then

To see this, note that W_XY−d ≤ mn−k if mn−k of the differences (Y_j − d − X_i) are at least zero; that is, if D_(k) ≤ d. The distribution of W_XY−d when d is the true shift does not depend on d—it is the same as the distribution of W_XY under the strong null hypothesis, as asserted previously. Using the symmetry of the null distribution of W_XY, we see that for any d,

Thus the random differences D_ij carve the real line into random intervals with known probabilities of containing the true value of d (if the shift alternative is true!). To find a confidence interval for d, we can start with (one of) the interval [D_(k), D_(k+1)) with the largest chance of containing d, then append intervals in decreasing order of their probability of containing d until the sum of their probabilities is at least 1−α. Because of the unimodality of the distribution (recall the power calculations), these intervals will be contiguous; let [D_(l), D_(u)) denote the interval. Alternatively, we could choose the intervals to try to have symmetric non-coverage probabilities; that is, so that

(I don't know whether these two approaches differ; I think the former corresponds to inverting the two-sided Wilcoxon test.) All the probability calculations are implicit in calculating the distribution of W_XY under the strong null hypothesis. In fact, one need not even calculate the entire distribution; see Lehmann (1998, pp. 92ff) for shortcuts.

Confidence intervals for quantiles from iid observations

The approach illustrated above for finding a confidence interval for the shift d is very similar to a nonparametric approach to finding confidence intervals for a quantile of a probability distribution. Let {X_j: 1 ≤ j ≤ n} be iid with continuous cdf F. For 0<q<1, let x_q=F⁻¹(q) denote the qth quantile of the common distribution of the data. We shall find a confidence interval for q that does not depend on any other assumptions about F.

If x_q=x, the probability that each datum X_j will be at most x is q. The data are independent, so the number N(x) of data that are less than or equal to x has a binomial distribution with parameters n and q. Let

p_n,q(k) = _nC_k × q^k × (1−q)^n−k = P_{x_q=x} ( N(x)= k ), k = 0, 1, … , n.

Note that P_{x_q=x} ( N(x)= k ) does not depend on x_q. The probability distribution of N(x) is unimodal; we could build a level α test of the hypothesis that x_q=x by assigning to the rejection region the values of k for which p_n,q(k) is smallest, subject to the constraint that the sum of the probabilities is at most α.

To find a confidence interval for x_q, we could start with the kth order statistic X_(k) as a trial value for x, where k is as close as possible to q×n, then work "outward" as before, increasing and decreasing x until the test of the hypothesis x_q=x rejects.

We can streamline the approach in the same way as we did to find a confidence interval for the treatment effect d. The values of x at which N(x) changes are the order statistics X_(j), so it suffices to consider them when searching for the endpoints of the confidence interval. Now

The order statistics carve the real line into intervals with known chances of containing the true value of x_q. Those chances are determined by the binomial distribution with parameters n and q. We can find a confidence interval for x_q by starting with the interval [X_(k), X_(k+1)) with the largest chance of containing x_q, then appending intervals in decreasing order of their probability of containing x_q until the sum of their probabilities is at least 1−α.

For the median, q=0.5, and the probability distribution of the number of data that are less than or equal to x_q is symmetric about n/2. Consider a confidence interval for the median on the basis of n=20 iid observations. shows the binomial distribution with parameters n=20 and p=0.5. One can see from the figure that the probability of between 8 and 12 successes is 73.7%. Eight "successes" corresponds to the interval [X₍₈₎, X₍₉₎) containing the true median; twelve successes corresponds to the interval [X₍₁₂₎, X₍₁₃₎) containing the median, so the interval [X₍₈₎, X₍₁₃₎) from the eighth order statistic to the thirteenth order statistic is a 73.7% confidence interval for the median of the parent distribution on the basis of 20 iid observations. Note that this confidence interval trims off the 7 smallest and 7 largest observations.

You need Java to see this.

Exercise. Generalize this approach to finding confidence intervals for quantiles to find conservative confidence intervals for quantiles when the cdf F is not necessarily continuous.