Paired Comparisons and Multiple Treatments

Reference: Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.

The sign test

The rank sum test has the most power when the population is relatively homogeneous. If the population is extremely heterogeneous, the variability of responses across individuals can make it hard to detect a real treatment effect. When the population is heterogeneous, sometimes we can have more power to detect a real treatment effect by dividing the population into subgroups that are less heterogeneous than the population, making comparisons within subgroups, and then combining the results across subgroups. The farthest this idea can go is to divide the population into pairs of subjects as similar as possible to each other in the values of some measured variables.

Suppose we form pairs of subjects according to the values of some variables that might affect the outcome, then randomly assign one member of each pair to treatment and one to control, by a fair coin toss, for example. Let there be n pairs of subjects. Suppose the assignment to treatment or control is independent across pairs. We will look at two approaches to testing whether treatment affects the response. The first method, the sign test, works well even when different pairs are heterogeneous. (However, see Freedman, 2005. FIX ME!) The second method, the Wilcoxon signed rank test, works better when the different pairs are homogeneous.

Let S denote the number of pairs for which the response of the treated subject is larger than the response of the subject assigned to control. Suppose for the moment that there are no ties within pairs. Under the null hypothesis that treatment makes no difference, S has a binomial probability distribution with parameters n and p=½. If treatment has a positive effect, we would expect S to be larger than n/2, so we would reject for large values of S. Critical values can be calculated from the binomial distribution. This gives the sign test based on paired samples.

Suppose now that there are ties within pairs. Consider the test statistic S' = S + ½S0, where S0 is the number of ties. Under the strong null hypothesis, pairs are not separated, so the number of ties S0 is always the same: the probability distribution of S' is that of a constant plus a binomial random variable with parameters nS0 and p=½. Critical values for the test thus can be computed using the binomial distribution (but for nS0 trials). (Ties might instead be broken at random independently; it turns out that discarding ties leads to the more powerful test.)

A nice feature of the test is that within pairs, one only needs to be able to judge which subject's response is better—that is, one only needs to rank pairs of observations. That is equivalent to using just the sign of the difference between the two responses, which is why this is called the sign test. However, because the test does not use the magnitude of the difference, typically there are tests with better power when quantitative differences are available; see below. Thus the primary application of the sign test is to situations where only the sign of the difference is observed.

Like the Wilcoxon rank sum test, the sign test also makes sense in some population models, not just the randomization model. With suitable restrictions, the critical values are the same, and one can calculate the power against various alternatives.

Suppose we draw n pairs of subjects from a population of pairs, and assign one from each pair to treatment and one to control, at random with probability ½, independently across pairs. Let Xj denote the response of the subject assigned to control from the jth pair, and let Yj denote the response of the subject assigned to treatment from the jth pair. The pairs {(Xj, Yj): 1 ≤ jn} are identically distributed. Following Lehmann's notation, define

M(x,y) = P(Xjx and Yjy).

Let Zj = YjXj. The random sampling assumption implies that {Zj} are identically disributed. Let FZ denote their common cdf. If there is no treatment effect, Xj and Yj are exchangeable, so the distribution of Zj is the same as the distribution of −Zj. That is, under the hypothesis that treatment has no effect, FZ(z) = 1 − FZ(−z): the distribution of Zj is symmetric about zero. If treatment tends to increase the response, the distribution of Zj will favor positive values.

Suppose the population of pairs is large enough compared with n that we can neglect the fact that the sample of n pairs is drawn without replacement, so that the pairs are effectively iid, not just identically distributed. Then {Zj} are also iid. This is the population model; the null hypothesis is that the distribution of the Zj is symmetric about zero; the alternative hypothesis is that it favors positive values. Suppose that FZ is continuous so that the probability that Zj=0 is 0. Then, under the null hypothesis,

P(Zj > 0) = P(Zj < 0 ) = ½.

The Zj's are independent, so the null probability distribution of the number of strictly positive Zj's is binomial with parameters n and ½. Under the alternative that the probability that Zj>0 is larger than ½, the number of positive Zj's will tend to be larger; its probability distribution is binomial with parameters n and p=1−FZ(0)>½. Thus the sign test derived above is applicable; the power of the test is clearly monotone in p, so the test is unbiased against alternatives p>½. In the shift model FZ(z) = F0(zd), where F0 is the cdf of a continuous distribution that is symmetric about zero, the level a sign test of the hypothesis that the distribution of the Z's is symmetric about zero against the alternative that positive values are favored is also a level a test of the null hypothesis d≤0 against the alternative d>0.

The number of positive Zj also has a binomial distribution with parameters n and ½ in other situations, for example, when the Zj's are independent, not necessarily identically distributed, but each has probability ½ of being positive or negative (for example, each has a probability distribution that is symmetric about zero). Then the sign test is a level a test of the null hypothesis against the alternative that the distributions of the Zj's favor positive values. See Divenyi, Stark and Haupt, 2005. Decline of Speech Understanding and Auditory Thresholds in the Elderly, J. Acoustical Soc. Am., 118, 1089–1100, for examples.

Note that the kinds of violations of the null hypothesis that the sign test is sensitive to are those for which the median is positive—other violations of symmetry result in power equal to the significance level.

As before, when ties are possible, the null distribution of the number of strictly positive Zj's given the number of zero Zj's is binomial with parameters n−(#ties) and ½, so we can use the same approach we used for the Wilcoxon rank sum test to get conditional tests in the presence of ties.

The sign test also can be used to test hypotheses about the median of a continuous distribution from iid observations {Z'j: 1 ≤ jn}, because then the variables Zj = Z'jd are iid with P(Zj < 0) = P(Zj >0) if d is their median. This is essentially a special case of the approach we used to find confidence intervals for quantiles from iid observations.

The Wilcoxon Signed-Rank Test

The sign test ignores all information but the signs of the differences between treatment and control. That allows the pairs to be completely different from each other. When there is some homogeneity across pairs, it can help to retain some quantitative information about the response: the difference between the control and treatment responses for each pair, or at least the rank of the difference. Suppose there are n pairs of subjects, as before, with one member of each pair assigned to treatment and one to control, at random, with probability 50%, independently across pairs. We are interested in the alternative hypothesis that treatment increases the response. Let t be the number of pairs for which the treatment response "wins"—exceeds the control response—and let c = nt denote the number of wins for control. For the moment, we assume that there are no ties among the differences, and that none of the differences is zero. We rank the response differences by absolute value, then give each rank the sign of the corresponding difference. Let R1 < R2 < …Rc be the ranks with negative signs, and let S1 < S2 < …< St be the ranks with positive signs. Then

{R1, R2, …Rc, S1, S2, …, St } = {1, 2, …, n}.

There are 2n possible sign combinations for the ranks, ±1, ±2, …, ±n. Specifying a combination of signs is equivalent to listing the positive-signed ranks {S1, S2, …, St }. If treatment has no effect, the rank of the response difference of each pair is fixed, but the signs of the response differences are iid, positive with probability ½ and negative with probability ½. Under the null hypothesis, each of the 2n assignments of signs is equally likely; each corresponds to a number N+ = xn of positive differences, and a subset {S1, …, Sx} of x of the n ranks. Thus, under the strong null hypothesis,

P(N+ = x; S1=s1, S2=s2, …, Sx=sx ) = 1/2n.

On the other hand, if treatment tends to increase the response, we would expect more and larger treatment differences to be positive. That is, we would expect

Vs = S1 + S2 + … + St

to be large. Note that under the strong null hypothesis, t is random. The test statistic Vs leads to the Wilcoxon signed rank test: reject the strong null hypothesis if VsC, where C is chosen so that the test has the right significance level. The critical values for the Wilcoxon signed rank test are not the same as the critical values for the Wilcoxon rank sum test.

When there are three pairs of subjects, the possible ranks of the absolute values of the three treatment differences are {1, 2, 3}, and the possible signed ranks are {±1, ±2, ±3}. Table shows all 23=8 possible signed rank combinations, and the corresponding values of the sum of the positive signed ranks.

Under the strong null hypothesis, all 8 assignments of signed ranks are equally likely, so the chance of each is 1/8. The null distribution of the sum of the positive signed ranks is therefore the distribution in table .

Thus to test the null hypothesis at significance level 12.5% against the alternative that treatment increases the response, we would reject if V = 6.

The Wilcoxon signed rank test tends to be more powerful than the sign test. Critical values are in Lehmann (1998); they can also be estimated by simulation. There is a normal approximation as well: under the strong null hypothesis,

E(V) = N(N+1)/4; Var(V) = N(N+1)(2N+1)/24.

See Lehmann (1998). The Wilcoxon signed rank test can be extended to allow for ties by using midranks rather than ranks; the null distribution of the test statistic then depends on the configuration of ties, as it does for the Wilcoxon rank sum test and the sign test. Simulation is straightforward. Here is sample code in R:

        midRanks <- function(x) {
        # P.B. Stark, 10/5/05.  www.stat.berkeley.edu/~stark
        #
        # calculates vector of midranks
            sapply(x, function(z){(sum(x <= z) + sum(x < z) + 1)/2});
        }

        signedRankSum <- function(x,y) {
        # P.B. Stark, 10/5/05.  www.stat.berkeley.edu/~stark
        #
        # signed rank sum for control responses x and treatment
        # responses y.
        # Calls midRanks().
            r <- midRanks(abs(y-x));
            sum(r[(y-x)>0]);
        }

        wilcoxonSignedRankSim <- function(x, y, iter) {
        # P.B. Stark, 10/5/05.  www.stat.berkeley.edu/~stark
        #
        # simulated one-sided P-value for the Wilcoxon signed rank
        # test using midranks.
        #
        # x: vector of control responses
        # y: vector of treatment responses
        # iter: number of replications
        #
        # Calls midRanks()
            r <- midRanks(abs(y-x));
            v <- sum(r[(y-x) > 0]);
            sum(replicate(iter, sum(r[runif(x) > 0.5])) >= v)/iter
        }

Combining dissimilar blocks of homogeneous subjects

We have been working with pairs, which are the finest-grained blocks possible. Pairing is helpful when the population is extremely heterogeneous; it lets us control for differences that might influence the effect of treatment, provided we can find pairs of individuals who are similar with respect to those differences. Earlier (in the discussion that led to the Wilcoxon rank sum test), we assumed that individuals were homogeneous. Between these two extremes, we might have dissimilar blocks of homogeneous individuals, which we consider here.

Suppose we have b blocks. The jth block contains Nj subjects. Of those, nj are assigned at random to treatment, and the remaining mj = Njnj are assigned to control. There are

N = N1 + N1 + … Nb

subjects in all, of which

n = n1 + n1 + … nb

are assigned at random (within blocks) to treatment and

m = m1 + m1 + … mb

are assigned to control. In this scheme, there are

Nassign = N1Cn1 × N2Cn2 × … × NbCnb

equally likely possible assignments of subjects to treatment and control. To allow for heterogeneity across blocks, but combine evidence from different blocks, we could calculate the Wilcoxon rank sum statistic separately for each block, then form a weighted sum of the statistics across blocks. Let Wj be the Wilcoxon rank sum calculated for block j by itself. One choice of test statistics for the collection of blocks is

W = W1/(N1+1) + W2/(N2+1) + … + Wb/(Nb+1).

This choice downweights blocks with larger numbers of subjects. The null distribution of this statistic can be calculated using the fact that under the null hypothesis, the Nassign assignments of ranks are equally likely; alternatively, the null distribution can be simulated using random sampling.

More than two treatments: the Kruskal-Wallis Test

Suppose there are s mutually exclusive categories, and let

n = n1 + n2 + … + ns.

How many ways can the n objects be placed into the s categories, so that n1 are assigned to cetegory 1, n2 are assigned to category 2, etc.? There are nCn1 ways of selecting n1 of the objects to assign to the first category; once that is done, there are nn1Cn2 ways of selecting n2 of the remaining nn1 objects to assign to the second category, and so on. Finally, only ns objects remain for the sth category. The total number of possible assignments is thus

nCn1,n2,…, ns = nCn1 × nn1Cn2 × … × nn1n2 − …− ns−1Cns = n!/(n1!n2! … ns!)

This is the multinomial coefficient.

Suppose we wish to compare s different treatments using n subjects. We select n1 at random and assign them to the first treatment; we select n2 at random and assign them to the second treatment, and so on. Suppose that there are no ties among the responses. Under the strong null hypothesis that all the treatments are the same, all nCn1,n2, … , ns assignments of the n ranks among the treatments are equally likely.

Consider the ranks of the pooled responses, and suppose for the moment that there are no ties (we will generalize to midranks presently). Let Rj denote the mean of the ranks of the responses to treatment j. Under the null hypothesis, the expected value of Rj is the average of all the ranks (if there are no tied observations, this is (n+1)/2; more generally, the expected mean rank for each treatment is the average of the midranks of the pooled responses). Let r denote this expected value. We can test the null hypothesis using a weighted sum of squares of the deviations of the treatment mean ranks from their expected value:

K = (12/(n(n+1))) ( n1(R1r)2 + n2(R2r)2 + …; + ns(Rsr)2 ).

We reject the null hypothesis if KC, where C is chosen to get the right significance level. If there are no ties, under the null hypothesis K converges in distribution to a chi-square random variable with s−1 degrees of freedom as n1,…,ns grow. If there are ties, a different definition is usually used; see Lehmann (1998).

Here is R code to simulate the null distribution of K computed using midranks.

        kruskalWallisSim <- function(x, iter) {
        # P.B. Stark, 10/5/05.  www.stat.berkeley.edu/~stark
        #
        # Simulated P-value for the Kruskal-Wallis test using midranks.
        # Does not normalize the Kruskal-Wallis statistic.
        #
        # Calls midRanks()
        #
        # x: list of arrays of observed responses. The number
        #    of arrays is the number of treatments.
        # iter: number of iterations in the simulation
        #
            names(x) <- 1:length(x);
            resp <- stack(x);
            resp[,1] <- midRanks(resp[,1]);
            resp[,2] <- factor(resp[,2]);
            exRank <- mean(resp[,1]);          # expected mean rank for each treatment
            sqComp <- function(z){length(z)*(mean(z)-exRank)^2}
            ts <- sum(tapply(resp[,1],resp[,2],sqComp))
            sum(replicate(iter,(sum(tapply(sample(resp[,1]),resp[,2],sqComp))>=ts)))/iter;
        }

We want to compare two car waxes and control (no wax). We have 6 new cars. We select two at random to get wax A, two at random to get wax B, and leave two unwaxed. After a year, a judge who does not know which car received which treatment ranks the finishes of the cars. The ranks are as shown in table

There are 6C2,2,2 = 90 possible assignments of the 6 cars among the 3 treatments with 2 cars per treatment. The average rank—which is the expected mean rank in each treatment under the null hypothesis—is r=3½.

K = (12/(6×7)) × ( 2(2 − 3½)2 + 2(4 − 3½)2 + 2(4½ − 3½)2 ) = 2.

We could find the P-value for these data by enumerating the value of K for all 90 assignments; because the number of subjects assigned to each treatment is the same, there are lots of symmetries that reduce the work. Because the three treatments all have the same number of subjects, each assignment corresponds to 3! arrangements of the subjects that have the same value of the test statistic and the same probability under the null hypothesis. Thus we need consider only 90/6=15 arrangements, as shown in Table

In 8 of the 15 equally-likely outcomes, K≥2, so the exact P-value for these data is 8/15 = 0.533. The approximate P-value found using the R code above, with 10,000 iterations, is 0.533. The chi-square approximation to the P-value, using df=2, is 0.368.

Other topics

2 × t contingency tables: See Lehmann (1998).