The population model

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

So far, we have been considering the subjects to be a given collection of N individuals, assigned at random to treatment and control. This leads to simple probability calculations under the strong null hypothesis, because the only source of randomness is in the assignment, which behaves like a simple random sample of size n from a fixed population of N subjects. However, conclusions about the effect of treatment apply only to the N subjects studied. If we wish to generalize beyond these N to a larger population (which is often the goal), we need to know how the subjects in the study are related to that larger population.

In this chapter we consider what happens if the N subjects are themselves a simple random sample from that larger "parent" population. We shall assume that the larger population is much much larger than N. Of the N subjects drawn from the parent population, a random sample of size n is assigned to treatment, and the other m = Nn are assigned to control. As before, let {Y1, … , Yn} denote the responses of the subjects assigned to treatment and let {X1, … , Xm} denote the responses of the subjects assigned to control. Because the subjects are drawn at random from the parent population, the probability distribution of the responses of the treated subjects is the same, as is the probability distribution of the responses of the control subjects.

The responses are not strictly independent because the subjects are drawn without replacement from the larger population, but if the larger population is large enough compared to N, it is reasonable to treat the responses as if they were independent. Under this assumption, we have the population model for comparing two treatments.

Let FX denote the cumulative distribution function (cdf) of the control responses and let FY denote the cdf of the treatment responses. The strong null hypothesis in this context is that FX=FY. A weaker null hypothesis is that EX=EY, that is, that the expected values of the two distributions are the same. If FX and FY are continuous distributions then the probability of ties among the data is zero.

Let N=n+m. Suppose {X1, … , Xm, Y1, … , Yn} are independent, identically distributed continuous random variables, and let {S1, … , Sn} be the ranks of the Y's among the N variables. Then for any subset {s1, … , sn} of size n of the integers {1, 2, … , N},

P(S1 = s1, S2 = s2, … , Sn = sn) = 1/(NCn).

For a proof, see Lehmann (1998, p. 58). The crucial idea is that under the strong null hypothesis, the distribution of all N variables is symmetric, so the n ranks {s1, … , sn} are equally likely to be the ranks of any n of the N variables.

Thus all the results about the null distribution of rank-based statistics (such as the Wilcoxon rank-sum) derived for the randomization model also apply to the population model under the assumption of random sampling and random assignment, provided the parent population is sufficiently large and the distribution of the data is continuous. The resulting tests remain nonparametric: the significance levels do not depend on the form of F=FX=FY.

When the distribution of the data is not continuous, and the probability of ties among the data is greater than zero, the null distribution of the mid-ranks indeed depends on F=FX=FY, because F determines the null probability of tied observations. If we condition on the list of midranks, however, the conditional distribution of the allocation of midranks to treatment and control is the same as it is for the randomization model, so we can construct tests at conditional significance levels using the same computations as before. The unconditional significance level of such tests is a weighted average of the conditional significance levels, because if {Aj: j=1, … , K} is a partition of the whole outcome space, then for any event A (and in particular the event that we commit a type I error),

P(A) = P(A1)×P(A|A1) + P(A2)×P(A|A2) + … + P(AK)×P(A|AK).

The unconditional significance level is thus between the smallest and largest conditional significance levels, but it depends on unknown properties of the population distribution F, because the configuration of ties determines the conditional significance levels that are attainable. If there are few observations, a large proportion of which are tied, only large (conditional) significance levels are attainable. The set of attainable conditional significance levels tends to get denser as the sample sizes m and n increase, making it possible to get more nearly unconditional tests. The conditional significance levels are asymptotically unconditional as m and n both increase.

The population model can be used to test the hypothesis that FX=FY even when the assignment to "treatment" is not at random, but is just a labeling of the subjects. However, even if we reject the null hypothesis, we cannot infer why the two distributions differ. For example, suppose we want to test the hypothesis that, in some population, getting more than 50% of one's calories from fat is associated with obesity, as defined by body mass index, BMI. We could take a random sample of people from that population, determine their dietary fat intake by monitoring their diet to determine whether their dietary fat intake exceeds 50% of their daily caloric intake, and measuring their BMI. We can think of this as taking random samples from the subpopulation of people with low dietary fat intake and the subpopulation of people with high dietary fat intake, and measuring a binary response (obesity) for each subject—the two sample sizes are random, but they sum to N. We could use Fisher's exact test to assess (conditionally on the overall incidence of obesity and on n and m) whether the incidence of obesity differs between the two subpopulations. Suppose we conclude that the incidence of obesity does differ. We still should not conclude that high dietary fat intake causes obesity, because many other factors are likely to be different between the two groups, and the "assignment" of subjects to low or high dietary fat intake was not at random—the subjects choose their own diets.

Lehmann (1998, p. 64–65) identifies five models for five problems, all of which lead to the same (conditional) null distribution of ranks:

  1. Randomization model for comparing two treatments. N subjects are given and fixed; n are assigned at random to treatment and m=Nn to control.
  2. Population model for comparing two treatments. N subjects are a drawn as a simple random sample from a much larger population; n are assigned at random to treatment and m=Nn to control. Condition on the configuration of ties if there are ties among the data.
  3. Comparing two sub-populations using a sample from each. A simple random sample of n subjects is drawn from one much larger population with many more than N members, and a simple random sample of m subjects is drawn from another population with many more than m members. Condition on the configuration of ties if there are ties among the data.
  4. Comparing two sub-populations using a sample from the pooled population. A simple random sample of N subjects is drawn from the pooled population, giving random samples from the two populations but with random sample sizes. Condition on the sample sizes and, if there are ties, on the configuration of the ties.
  5. Comparing two sets of measurements. Independent sets of n and m measurements come from two sources. Condition on the configuration of ties if there are ties among the data.

Power of the Wilcoxon rank-sum test

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

To find the power of the Wilcoxon rank sum test in the randomization model, we needed to assume very specifically how treatment affected each subject, for example, that treatment adds the same amount to each subject's response. The resulting power calculation depended on the actual observed response values. In contrast, in the population model, we can compute the power against some alternatives without knowing the responses, because the distributions FX and FY tell us what to expect the data to be like.

To compute the power, it is most convenient to work with the Mann-Whitney form of the Wilcoxon rank-sum test, which uses the statistic WXY. Recall that

WXY = #{ (i, j) :   1 ≤ im, 1 ≤ jn, and Xi < Yj}

is the number of (control, treatment) pairs with the control response less than the treatment response. Let c denote the critical value for the one-sided test based on the Mann-Whitney statistic WXY, so we reject the null hypothesis if

WXYc.

Suppose that the significance level of this test is α, so that under the strong null hypothesis,

P(WXYc) = α.

Consider the shift alternative, in which treatment increases every subject's response by the same amount d>0. It follows that

FY(y) = FX(yd).

Under this alternative, if we know FX and d, we know FY. Consider the power of the Wilcoxon rank-sum test against this alternative, as a function of d, keeping FX fixed:

ΦFX(d) = P(WXYc | FY(y) = FX(yd) ).

This is well defined even when d ≤ 0. We hope that Φ(d), the power of the test against the shift alternative, increases as d increases—that the test has a better chance of finding a large effect than a small effect—and this is indeed true.

Theorem. (Lehmann, 1998, p. 67) ΦFX(d) is monotonically nondecreasing in d.

Proof. Suppose d1 < d2. Let X1, … , Xm be iid FX and let Y1, … , Yn be iid FX+d1. Let Zj = Yj + (d2d1), j = 1, … , n. Note that

FZ(y) = P(Zjy) = P(Xiy − (d1 + (d2d1)) ) = P(Xiyd2 ) = FX(yd2).

Thus

ΦFX(d2) = P(WXZc).

But point by point, Zj>Yj, so WXZWXY, and thus ΦFX(d2) ≥ ΦFX(d1).

It follows from the theorem that the Wilcoxon rank-sum test is unbiased against shift alternatives with positive shift d: the power against all such alternatives is at least α, the significance level of the test. (A test is unbiased if the chance of correctly rejecting the null hypothesis when the alternative is true is at least as large as the chance of a type I error—erroneously rejecting the null hypothesis when the null hypothesis is true.) The result also shows that the power against shift alternatives with negative shift is no larger than the significance level; i.e., that the one-sided Wilcoxon test is actually a significance level α test of the null hypothesis H: d ≤ 0 against the alternative H': d > 0.

Essentially the same proof shows that if the treatment effect depends on x but is nonnegative for every x, the power of the Wilcoxon test is at least α. That is, suppose that

FY(x) = FX(x + d(x)),

where d(x) ≥ 0 for all x. Then P(WXYc) ≥ α. Similarly, if d(x) ≤ 0 for all x then P(WXYc) ≤ α, so the Wilcoxon test is a significance level α test of the null hypothesis H: d(x) ≤ 0 for all x against the alternative hypothesis H': d(x) > 0 for all x

A similar result holds whenever the response to treatment is stochastically larger than the response to control, i.e., whenever FY(x) ≤ FX(x) for all x.

Asymptotic Power and Comparison with Student's t Test

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

Define p1 = P(X < Y), where X and Y are independent random variables distributed as FX and FY respectively. When FX and FY are continuous distributions, the probability distribution of the standardized Mann-Whitney statistic is asymptotically a standard normal distribution provided p1 is strictly between 0 and 1. The expected value of WXY is mnp1; WXY/(mn) is an unbiased estimate of p1.

Define p2 = P( X < Y and X < Y') and p3 = P( X < Y and X' < Y), where X, X', Y and Y' are independent, X and X' have distribution FX, and Y and Y' have distribution FY. Then

Var(WXY) = mnp1(1−p1) + mn(n−1)(p2p12) + mn(m−1)(p3p12).

Both the expected value and variance of the Mann-Whitney statistic depend on properties of the data distributions that are in principle unknown. With additional assumptions, more can be said; there are also approximations. For example, if the critical value of a right-tailed test using the Mann-Whitney statistic is c, then the normal approximation (with continuity correction) to the power of the test against a fixed alternative GF is

Φ(F, G) approx. 1 − z( (c−½−mnp1)/SE(WXY) ),

where z(x) is the standard normal cdf. In the shift model, for small shift d,

ΦFX(d) approx. z ( (12mn/(N+1))½f*(0) dz1−α ),

where z is the standard normal cdf, zα is the 1−α quantile of the standard normal distribution, and f* is the density of the difference between two independent random variables each with distribution FX.

Student's t-test is as follows: reject the hypothesis that treatment has no effect when

(YX)/( S(1/m + 1/n)½ )c,

where X is the sample mean of the control observations, Y is the sample mean of the treatment observations, and S is the square-root of the pooled sample variance

S2 = ( (X1X)2 + (X2X)2 + … + (XmX)2 + (Y1Y)2 + (Y2Y)2 + … + (YnY)2 )/(m + n − 2).

The critical value c is the α quantile of Student's t-distribution with m+n−2 degrees of freedom. Under the null hypothesis that the control and treatment observations are independent with normal distributions and have the same variance, that control observations have the same mean, and that the treatment observations have the same mean (which differs from the mean of the control observations by d), Student's t-test has level α and is the uniformly most powerful unbiased level α test of the hypothesis d=0 against the alternative d>0.

Student's t-test is asymptotically distribution-free: even when FX is not normal, under the null hypothesis that FX = FY, provided the observations are iid and the variance of FX is finite, the distribution of the t-test statistic is asymptotically normal with mean zero and variance one.

Suppose that FX is given and consider the shift alternative with shift d. Let n=m and fix the significance level α. Let Φ denote the power of the Wilcoxon test, and consider the sample size n'=m' that would be required for Student's t-test to have power Φ for the same level α, the same shift d, and the same distribution FX for the control observations. The ratio e=n'/n is called the efficiency of the Wilcoxon test relative to Student's t-test. Low efficiency means many more observations are required for the Wilcoxon test to have the same performance (significance level and power) as Student's t-test.

Perhaps surprisingly, when FX is normal, the efficiency of the Wilcoxon test is pretty large: Table 2.2 of Lehmann (1998) shows that for n=m=25, α=0.01, and d between about 0.14 and 1.3, the approximate efficiency is between 0.85 and 0.96. For n=m=5 and α=4/126 and d between 0.5 and 4, the efficiency is between about 0.96 and 0.98.

Thus (for this range of parameters) if the true distribution of the control and treatment observations really is normal, little is lost in using the nonparametric Wilcoxon test, while if the true distribution is not normal, the Wilcoxon test has the advantage of maintaining level α, whereas Student's t-test can have a radically different true level.

When the distribution FX is not normal, the Wilcoxon test can have better performance than Student's t-test. Asymptotically in n and m, the efficiency does not depend on the target power or the significance level, provided the significance is less than the power (the limit is called the Pitman efficiency or the asymptotic relative efficiency). The Pitman efficiency of the Wilcoxon test relative to the Student t-test is about 1.1 for the logistic distribution, 1.5 for the double exponential distribution, 1 for the uniform distribution, and 3 for the exponential distribution (Lehmann, 1998, p. 80). The performance of the Wilcoxon test is remarkably good (compared to that of Student's t-test) for long-tailed distributions. Moreover, the performance of the Wilcoxon test is never much worse than that of the Student t-test: provided the variance of FX is finite, the asymptotic relative efficiency is at least 0.864 (see Lehmann, 1998, p. 377).

The normal scores test

Reference: Lehmann (1998, pp. 96–97).

The idea of the normal scores test is to replace each order statistic by the expected value of the corresponding order statistic of a normal distribution, then to compute Student's t-statistic for those substituted data. The critical values for the resulting test are not the critical values for the Student t-distribution, but they have been tabulated. The remarkable result is that this test, which depends only on the ranks of the observations, has an asymptotic (Pitman) efficiency greater than or equal to one relative to Student's t-test whatever be FX. The asymptotic relative efficiency is one when FX is normal.

Estimating the shift d

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

Assume that the effect of treatment is just a shift d. Then distribution of {Xj: j= 1, … m} is the same as that of {(Yjd): j= 1, … n}. Consider a two-sided test of H: d=0. Suppose that the null distribution of WXY is symmetric about some value μ, so we would reject H if WXY≥μ+c or if WXY≤μ−c. Then a reasonable estimate of d is the value d' for which the statistic WX,Y−d' is closest to μ. This value d' turns out to be the median of the mn differences Dij = YjXi, if we define the median of an even number of observations to be the mean of the two central observations in the sorted list.

Lehmann (1998) shows that the distribution of the error in this estimator, (d'−d), is independent of d:

d'−d = med(YjXi) − d

= med(YjdXi)

= med((Yjd) − Xi),

but the distributions of (Yjd) and Xi don't depend on d.

The estimator d' is unbiased and median unbiased (at least) if the distribution FX is symmetric or if n=m, because then the distribution of d' is symmetric about d.

Confidence intervals for the shift d

The approach here is based on the duality between tests and confidence sets. Suppose we have a family of level α tests of the null hypotheses Hd': d=d' for all real d'. Let S(y) denote the set of values of d' for which the test of the corresponding hypothesis d=d' would not rejected the null hypothesis if the data were y (here y is generic; in the problems we have been considering, y specifies all the treatment responses and all the control responses). Then S(Y) is a 1−α confidence set for d.

Proof. If f is the true value of d, then the probability (under f) that the corresponding test rejects the hypothesis d=f is at most α. The probability that the test does not reject is at least 1−α, so with probability 1−α under f, the value f is in S(Y). That is,

Pf ( S(Y) contains f ) ≥ 1−α.

The shape and size of the confidence set are tied to the family of hypothesis tests that are being "inverted." No matter what, the procedure gives a 1−α confidence set, but it can be large or bizarre (e.g., consist of disjoint pieces) according to the nature of the tests. The duality between tests and confidence sets can be exploited to produce confidence sets with desirable properties; see, for example, Benjamini and Stark (1996. Non-equivariant simultaneous confidence intervals less likely to contain zero, J. Amer. Stat. Assoc., 91, 329–337); Benjamini, Hochberg and Stark (1998. Confidence Intervals with more Power to determine the Sign: Two Ends constrain the Means, J. Amer. Stat. Assoc., 93, 309–317); or Evans, Hansen and Stark (2005. Minimax Expected Measure Confidence Sets for Restricted Location Parameters, Bernoulli, 11, 571–590).

To use this result to find a confidence set for d, we need a family of level α tests of the hypotheses d=d'.

Consider the Mann-Whitney statistic WXY. Under the strong null hypothesis, the expected value of WXY is mn/2. Suppose that the critical values for the two-sided test of the hypothesis that d=0 are mn/2−c and mn/2+c. Note that under the hypothesis that d=d', the distribution of the Mann-Whitney statistic for the data

{X1, … , Xm, Y1d', … , Ynd'},

which we denote WX,Y−d', is the same as the distribution of WXY under the strong null hypothesis. A two-sided level-α test of the hypothesis d=d' thus would reject if WX, Y−d' ≤ mn/2−c and if WX, Y−d' ≥ mn/2+c. This is a family of tests on which we can base the confidence set.

We can find a confidence interval for d using these tests as follows: Let d' be the estimated treatment effect, the median of the mn differences Dij. Subtract d' from the treatment observations; compute the ranks of this new data set; perform a two-sided Wilcoxon rank sum test of the null hypothesis of no treatment effect using these new data. If the test does not reject, include d' in the confidence set. Increase d' systematically until the test rejects, to get the upper endpoint of the confidence interval; decrease d' systematically until the test rejects, to get the lower endpoint of the confidence interval.

It is clear that the only values of d' that could be endpoints of the confidence interval are those that cause the ranks of the treatment observations to change—only those values change the value of WXY. We can exploit this to find a confidence interval for d with less trial-and-error searching.

To change the value of WXY−d', we need to change d by enough to change the number of pairs (i, j) such that Xi<Yjd'. The values of d' that change the ranks are precisely the differences

Dij =  Yj − Xi.

Let D(1), … , D(mn) denote the order statistics of the mn differences Dij, and define D(0) = −∞ and D(mn+1)= ∞. Then

D(k)d < D(k+1) if and only if WXY−d = mnk.

To see this, note that WXY−dmnk if mnk of the differences (YjdXi) are at least zero; that is, if D(k)d. The distribution of WXY−d when d is the true shift does not depend on d—it is the same as the distribution of WXY under the strong null hypothesis, as asserted previously. Using the symmetry of the null distribution of WXY, we see that for any d,

Pd(D(k)d < D(k+1)) = P0(WXY = mnk) = P0(WXY = k).

Thus the random differences Dij carve the real line into random intervals with known probabilities of containing the true value of d (if the shift alternative is true!). To find a confidence interval for d, we can start with (one of) the interval [D(k), D(k+1)) with the largest chance of containing d, then append intervals in decreasing order of their probability of containing d until the sum of their probabilities is at least 1−α. Because of the unimodality of the distribution (recall the power calculations), these intervals will be contiguous; let [D(l), D(u)) denote the interval. Alternatively, we could choose the intervals to try to have symmetric non-coverage probabilities; that is, so that

Pd(D(l) > d ) is as close as possible to Pd(D(u) < d ).

(I don't know whether these two approaches differ; I think the former corresponds to inverting the two-sided Wilcoxon test.) All the probability calculations are implicit in calculating the distribution of WXY under the strong null hypothesis. In fact, one need not even calculate the entire distribution; see Lehmann (1998, pp. 92ff) for shortcuts.

Confidence intervals for quantiles from iid observations

The approach illustrated above for finding a confidence interval for the shift d is very similar to a nonparametric approach to finding confidence intervals for a quantile of a probability distribution. Let {Xj: 1 ≤ jn} be iid with continuous cdf F. For 0<q<1, let xq=F−1(q) denote the qth quantile of the common distribution of the data. We shall find a confidence interval for q that does not depend on any other assumptions about F.

If xq=x, the probability that each datum Xj will be at most x is q. The data are independent, so the number N(x) of data that are less than or equal to x has a binomial distribution with parameters n and q. Let

pn,q(k) = nCk × qk × (1−q)nk = Pxq=x ( N(x)= k ),   k = 0, 1, … , n.

Note that Pxq=x ( N(x)= k ) does not depend on xq. The probability distribution of N(x) is unimodal; we could build a level α test of the hypothesis that xq=x by assigning to the rejection region the values of k for which pn,q(k) is smallest, subject to the constraint that the sum of the probabilities is at most α.

To find a confidence interval for xq, we could start with the kth order statistic X(k) as a trial value for x, where k is as close as possible to q×n, then work "outward" as before, increasing and decreasing x until the test of the hypothesis xq=x rejects.

We can streamline the approach in the same way as we did to find a confidence interval for the treatment effect d. The values of x at which N(x) changes are the order statistics X(j), so it suffices to consider them when searching for the endpoints of the confidence interval. Now

N(x) = k   iff  X(k)x < X(k+1),   k = 0, 1, … , n,

where X(0) = −∞ and X(n+1) = ∞. Thus

Pxq ( X(k)xq < X(k+1)) = nCk × qk × (1−q)nk.

The order statistics carve the real line into intervals with known chances of containing the true value of xq. Those chances are determined by the binomial distribution with parameters n and q. We can find a confidence interval for xq by starting with the interval [X(k), X(k+1)) with the largest chance of containing xq, then appending intervals in decreasing order of their probability of containing xq until the sum of their probabilities is at least 1−α.

For the median, q=0.5, and the probability distribution of the number of data that are less than or equal to xq is symmetric about n/2. Consider a confidence interval for the median on the basis of n=20 iid observations. shows the binomial distribution with parameters n=20 and p=0.5. One can see from the figure that the probability of between 8 and 12 successes is 73.7%. Eight "successes" corresponds to the interval [X(8), X(9)) containing the true median; twelve successes corresponds to the interval [X(12), X(13)) containing the median, so the interval [X(8), X(13)) from the eighth order statistic to the thirteenth order statistic is a 73.7% confidence interval for the median of the parent distribution on the basis of 20 iid observations. Note that this confidence interval trims off the 7 smallest and 7 largest observations.

You need Java to see this.

Exercise. Generalize this approach to finding confidence intervals for quantiles to find conservative confidence intervals for quantiles when the cdf F is not necessarily continuous.