References: Lehmann, E.L., 1998. Nonparametrics: Statistical Methods Based on Ranks. Upper Saddle River, N.J.: Prentice Hall; SticiGui Chapter 19.

Ranks

The rank of an observation is its position in the list of observations after sorting the observations from smallest to largest. The jth smallest observation is the observation with rank j. If the data are { t1, … tN }, then t(j) is the observation with rank j, for j=1, 2, … , N. Thus

t(1)t(2) ≤ … ≤ t(N).

Ties can be broken arbitrarily. The smallest datum is t(1) and the largest observation is t(N).

A large class of nonparametric methods is based on working with the ranks of the data instead of the original values of the data. That is, t(1) is replaced by 1, t(2) is replaced by 2, and so on.

Why Rank-based methods?

In some problems, only the ranks of the data are observed. This often happens in situations where it is possible to make comparative judgements about outcomes, but not absolute judgements. For example, a doctor might be able to rank a collection of patients according to the severity with which they suffer from some disease they share, but might not be able to rate the severity on an objective quantitative scale—and certainly not a scale for which 5 is "better than" 4 by the same amount that 2 is "better than" 1. Similarly, a consumer might be able to order a collection of items by preference: she prefers A to B, B to C and C to D, for example. Yet she might not be able to quantify the strength of her preferences. Having techniques for dealing with ranks is then useful.

Working with ranks is also gives a useful replacement for the permutation test we derived in the previous chapter: The null distribution of Y for the permutation test depended on the observed responses, but the ranks of N distinct observations are always the integers from 1 to N. Thus if we replace each observation by its rank and then perform a permutation test, the null distribution of the new test statistic depends only on N and n. That lets us tabulate critical values for the test. (This is less important than it once was, because modern computers allow us to simulate the null distribution of the test statistic economically.) Moreover, the normal approximation to the test statistic based on the sum of the ranks of the responses of the treated subjects is good, which can be very useful.

In the following exposition, the notation and most of the derivations follow Lehmann, E.L. (1998. Nonparametrics: Statistical Methods Based on Ranks. Upper Saddle River, N.J.: Prentice Hall).

The Wilcoxon rank-sum test

In the randomization model we have been discussing, let S1, … Sn be the ranks of the n responses of the treatment group among the N responses of all subjects. That is, S1 is the rank of the response of the first treated subject, S2 is the rank of the response of the second treated subject, and so on. Each Sj takes a value between 1 and N, inclusive. Assume for the moment that no two responses are equal, so that the ranks are not ambiguous. Define

WsS1 + S2 + S3 + … + Sn.

The statistic Ws is the Wilcoxon rank-sum. We can base a test of the strong null hypothesis on Ws. To test against the alternative that treatment tends to increase responses, we would reject for large values of Ws. To test against the alternative that treatment tends to decrease responses, we would reject for small values of Ws. The critical value of the test is set using the probability distribution of Ws on the assumption that the strong null hypothesis is true. For a level-alpha test against the alternative that treatment increases responses, we would find the smallest c such that, if the strong null is true, P(Wsc) ≤ α. We would then reject the strong null if the observed value of Ws is c or greater. Recall that

1 + 2 + … + k = k(k+1)/2.

If the treated subjects have the smallest possible ranks, 1 to n, then

Ws= 1 + 2 + … + n = n(n+1)/2.

If the treated subjects have the largest possible ranks, Nn+1 to N, then

Ws= (Nn+1) + (Nn+2) + … + N

= (Nn) + 1 + (Nn) + 2 + … + (Nn) + n

= n(Nn) + (1 + 2 + … + n)

= n(Nn) + n(n+1)/2.

All the integers between n(n+1)/2 and n(Nn) + n(n+1)/2 are possible values of Ws. The null distribution of Ws under the strong null hypothesis is symmetric about n(N+1)/2. Thus the expected value of Ws under the null hypothesis is n×(N+1)/2. (This also follows from the fact that the expected value of the sample sum of a simple random sample of size n from a box of N numbers is equal to n times the mean of the N numbers; the mean of the integers 1 to N is (N+1)/2.) The variance of Ws under the strong null hypothesis is mn(N+1)/12.

Define Wr to be the sum of the control ranks. Because the treatment ranks and control ranks together comprise all the ranks,

Wr + Ws = 1 + 2 + … + N = N(N+1)/2.

Thus,

Wr = N(N+1)/2 − Ws.

It follows that the (null) expected value of Wr is

N(N+1)/2 − n(N+1)/2 = m(N+1)/2,

and that the (null) variance of Wr is also mn(N+1)/12. When n and m are both large, the normal approximations to the null distributions of Wr and to Ws tend to be accurate. We can check this by simulation using the sampling applet; see . You can change the contents of the box on the right and you can change the sample size to see how the approximation varies with N and n. To check the approximation, first draw 10,000 samples to get an approximation to the null probability distribution of Ws. Then highlight various intervals using the scrollbars in the figure and compare the area under the histogram with the area under the normal curve.

You need Java to see this.

The Wilcoxon rank-sum test is exactly what you get if you replace each observation by its rank, then do a permutation test using the sum of the (ranks of the) responses in the treatment group. Thus, the code in the previous chapter can be used to simulate the null distribution, by replacing the raw data with their ranks. However, the normal approximation to the null distribution of Ws can be better than the normal approximation to the null distribution of the sample sum of the responses to treatment, because the ranks are evenly spread out, whereas the raw responses can be highly skewed or multimodal.

Mann-Whitney Statistics

Define WXYWsn(n+1)/2. This subtracts from Ws its minimum possible value. Let WYXWrm(m+1)/2, Wr minus its minimum possible value. Under the strong null hypothesis, the probability distribution of WXY is the same as the probability distribution of WYX, a consequence of the symmetry of the probability distribution of Ws. The statistics WXY and WYX are called the Mann-Whitney statistics.

Let { X1, … , Xm } denote the control responses and let { Y1, … , Yn } denote the treatment responses as before. Consider

#{ (i, j) :   1 ≤ im, 1 ≤ jn, and Xi < Yj }.

This is the number of (control, treatment) pairs such that the control response is less than the treatment response.

Let { Sj: j = 1, … , n } be the ranks of the treatment responses as before, and let { Rj: j = 1, … , m } be the control responses. Let { S(j): j = 1, … , n } and { R(j): j = 1, … , m } be the corresponding ordered ranks. Use the value of the treatment response to partition the set of (control, treatment) pairs for which the control response is less than the treatment response: The total number of such pairs for which the control response is less than the treatment response is the number of pairs where the control response is less than the smallest treatment response, plus the number of pairs where the control response is less than the second-smallest treatment response, and so on. This will help us count the pairs.

How many response values are less than S(1), the rank of the smallest treatment response? By definition, there are S(1)−1 of them—all of which are control responses. The number of response values that are less than S(2) is S(2)−1, one of which is S(1), so the total number of control responses that are less than S(2) is S(2)−2. The total number of control responses that are less than S(j) is S(j)j, so the total number of (control, treatment) pairs with the control response less than the treatment response is

S(1)−1 + S(2)−2 + … + S(n)n = Ws − (1 + 2 + … + n) = Wsn(n+1)/2 = WXY.

Thus

WXY = #{ (i, j) :   1 ≤ im, 1 ≤ jn, and Xi < Yj }.

The Mann-Whitney statistic WXY (and the Wilcoxon rank sum Ws, up to an additive constant) measures the number of (control, treatment) pairs for which the treatment response is at least as large as the control response. The larger the positive effect of treatment, the larger the Mann-Whitney and Wilcoxon rank sum statistics tend to be.

Tied observations

If the control and treatment responses are random samples from continuous distributions, the chance of ties among the data is zero. However, ties can and do occur in practice, if only because of limits on the precision with which data can be recorded. When there are tied observations, the ranks are not uniquely defined. We can patch the rank-sum approach by assigning each set of tied observations the mid-rank of the set. For example, if the sorted data are {1, 2, 3, 4, 4} as above, the mid-rank of the last two observations (tied at 4) would be 4.5. If the sorted data were {1, 2, 3, 3, 3, 4, 4, 5}, the corresponding mid-ranks would be 1, 2, 4, 4, 4, 6.5, 6.5, and 8. The Wilcoxon rank-sum statistic generalized to use mid-ranks is denoted Ws*. Once there are ties, the null distribution of the rank sum depends on the observations: which mid-ranks are represented. When there are ties, the normal approximation to the null probability distribution of the rank-sum tends to be worse, because there are fewer possible values of the rank sum (the null probability distribution is "chunkier").

shows the sampling applet again, for the mid-ranks of the data set {1, 2, 3, 4, 4}, namely, {1, 2, 3, 4.5, 4.5}, with N=5, n=2. Draw 10,000 samples to see the simulated null distribution of the rank sum. Note that the distribution of the rank-sum is skewed in this case. Compare the simulated null distribution of the rank-sums with and without midranks by replacing the contents of the box by {1, 2, 3, 4, 5}, which would be the ranks if there were no ties in the data. Note that the distribution is then symmetric. Replace the contents of the box with the mid-ranks 1, 2, 4, 4, 4, 6.5, 6.5, and 8 and change the sample size (n) to 4. Compare the simulated sampling distribution of the rank sum with that for ranks 1, 2, 3, 4, 5, 6, 7, 8, and check the accuracy of the normal approximation in both cases.

You need Java to see this.

Calculating midranks seems to involve a number of steps: sort the observations, identify ties, and average the ranks of each group of ties. An alternative point of view makes the computation much simpler. If an observation is not tied, its rank is equal to the number of observations that are less than or equal to it. That is, suppose the data are z1, … , zN. If the multiplicity of the value zk is one, then

rank(zk) = #{ j: zjzk }.

On the other hand, if there are M(zk) observations in all whose values are equal to zk, then the midrank of zk is

midrank(zk) = #{ j: zjzk } − (M(zk)−1)/2.

We can calculate M(zk) by

M(zk) = #{ j: zjzk } − #{ j: zj < zk }.

Thus

midrank(zk) = ( #{ j: zjzk } + #{ j: zj < zk } + 1 )/2.

Here is a Matlab function to compute the midranks of a list:

        function r = midRanks(x)
        % function r = midRanks(x)
        % P.B. Stark, statistics.berkeley.edu/~stark  2/17/2003
        % vector of midranks of a vector x.
        % midrank is (#<=) - ((#=) - 1)/2 = ((#<=) + (#<) + 1)/2
            for j = 1:length(x),
                r(j) = sum(x <= x(j)) - (sum(x == x(j)) - 1)/2;
            end;
        return;
    

Here is an R function to do the same thing:

        midRanks <- function(x) {
        # P.B. Stark, statistics.berkeley.edu/~stark  2/5/2008
        # vector of midranks of a vector x.
            mr <- array(0,length(x));
            for (j in 1:length(x)) {
                mr[j] <- sum(x <= x[j]) - (sum(x == x[j]) - 1)/2;
            }
            mr
        }
    

Here is a Matlab function to simulate the distribution of the sample sum of a simple random sample of size n from a population of size N.

        function dist = simSampleSum(z, n, iter)
        % function dist = simSampleSum(z, n, iter)
        % P.B. Stark, statistics.berkeley.edu/~stark  2/17/2003
        % Simulate the sampling distribution of the sum of n of the elements of z
            dist = zeros(iter,1);
            N = length(z);
            if (n > N)
                disp(['error in simSampleSum: sample size exceeds population size']);
                return;
            elseif (n == N)
                dist = dist + sum(z);    % constant random variable
            elseif (n == 0)
                return;                  % zero
            else
                for i=1:iter
                    zp = z(randperm(length(z)));  % random permutation of data
                    dist(i) = sum(zp(1:n));       % add the first n
                end;
            end;
        return;
    

Here is a Matlab function to simulate the null distribution of the Wilcoxon rank sum using midranks.

        function dist = simWilcox(x, y, iter)
        % function dist = simWilcox(x, y, iter)
        % P.B. Stark, statistics.berkeley.edu/~stark  2/17/2003
        %  simulates the null distribution of the Wilcoxon rank sum using midranks,
        %  for data x (control) and y (treatment) using iter pseudorandom samples
            z = midRanks([x, y]);
            n = length(y);
            dist = simSampleSum(z, n, iter);
        return;
    

Here's an example of using the Matlab functions just defined to approximate the 1-sided (upper tail) P-value for the Wilcoxon rank sum statistic with midranks:

        iter = 10000;
        zr = midRanks([x y]);  % the control responses are x; the treatment
                               % responses are y. zr now has their midranks.
        dist = simWilcox(x, y, iter);
        pVal = sum(dist >= sum(zr[(length(x)+1):length(zr)]) )/iter;
    

The Siegel-Tukey Test

When the alternative is that the dispersion of the control responses differs from the dispersion of the treatment responses, the Wilcoxon rank-sum statistic has poor power. We could imagine a permutation test based on the inter-quartile range of the responses in the treatment group or the standard deviation of the responses of the treatment group; under the strong null hypothesis, we could find the distribution of that test statistic by calculation or simulation. The randomization, in which every subset of n of the N responses equally is likely to be the treatment responses, makes such calculations and tests straightforward, at least conceptually.

There is a clever way to relabel the data that allows us to use the calculations and tables for the Wilcoxon rank-sum statistic to test against the dispersion alternative—we just redefine what we mean by "rank." Suppose that the alternative is that the treatment decreases the dispersion of the responses around some common measure of location (such as the median). Then responses far from the middle rank should be less likely to occur in the treatment group than in the control group. Label the smallest observation "1", the largest "2", the second-smallest "3", the second-largest "4", and so on, ignoring the possibility of ties for the moment. Consider the sum T of the n labels of the treated subjects. The null distribution of T is the same as the null distribution of the Wilcoxon rank-sum statistic Ws, and under the alternative that treatment decreases dispersion, T tends to be large. Therefore, rejecting the strong null hypothesis when

T > c

where c is the appropriate quantile of the null distribution of the Wilcoxon rank-sum statistic, is a reasonable test against this alternative. This is the Siegel-Tukey test.

However, labeling the largest observation "1", the smallest "2", the second-largest "3", the second-smallest "4", and so on, leads to a test statistic with the same distribution under the strong null hypothesis, but that typically gives a different P-value for a given set of data. A different test statistic that is more symmetric is to assign the smallest and largest observations the label 1.5, the second-smallest and second-largest the label 3.5, the third-smallest and third-largest 5.5, and so on. Ties can be treated as they are in the generalization of the Wilcoxon rank-sum statistic Ws*, by using "mid-labels" (just like mid-ranks). To find significance levels or P-values when N is large, it is sometimes necessary to resort to simulation and settle for an approximation.

The Siegel-Tukey test tacitly assumes that the treatment responses and the control responses are scattered about a common typical value—that is, that there is no shift difference between the two groups. If there is a known shift between control and treatment, it could be subtracted before ranking the data. If the shift is unknown, it could be estimated from the data, but then the nominal significance level of the test will not be its real level. For large samples, the difference might be unimportant.

The Smirnov Test

This material is from Lehmann (1998).

We have considered shift and dispersion alternatives; now we move to the omnibus alternative. The Smirnov test is based on the difference between the empirical cumulative distribution function (cdf) of the treatment responses and the empirical cdf of the control responses. It has some power against all kinds of violations of the strong null hypothesis. However, it has less power than Ws against the alternatives for which the Wilcoxon rank-sum test is designed. Let FY,n denote the empirical cdf of the treatment responses:

FY,n(y) ≡ #{ yj: yjy }/n,

and define FX,m analogously as the empirical cdf of the control responses. If there are no ties within the treatment group, FY,n jumps by 1/n at each treatment response value. If there are no ties within the control group, FX,m jumps by 1/m at each control response value. If k of the treatment responses are tied and equal to y0, then FY,n jumps by k/n at y0. The Smirnov test statistic is

Dm,n ≡ sup |FY,n(y) − FX,m(y) |,

where the supremum is over all real values of y. It is easy to see that the supremum is attained at one of the data values. We can also see that the supremum depends only on the ranks of the data, because the order of the jumps matters, but the precise values of y at which the jumps occur do not matter. Therefore, the test

Reject if Dm,n > c,

for an appropriately chosen value of c, is a nonparametric test of the strong null hypothesis.

Let's calculate the null distribution of Dm,n for n=3, m=2. There are 5C3=10 possible assignments of 3 of the subjects to treatment, each of which has probability 1/10 under the strong null hypothesis. Let us assume that the 5 data are distinct (no ties). Then lists the possibilities and the corresponding values of Dm,n.

The null probability distribution of Dm,n is given in

Thus a Smirnov test (against the omnibus alternative) with significance level 0.2 would reject the strong null hypothesis when D2,3=1; smaller significance levels are not attainable when n and m are so small. Critical values can be calculated analytically fairly easily when n=m (when the treatment and control groups are the same size) and the significance level is an integer multiple a of 1/n. Let k = ⌊ n/a ⌋ be the largest integer such that nk×a ≥ 0. Then, under the strong null hypothesis, provided there are no ties,

P(Dm,n > a/n) = ( 2nCna2nCn−2a + 2nCn−3a − … + (−1)k−1× 2nCnka )/(2×2nCn).

When there are ties, Dm,n tends to be smaller, so tail probabilities from this expression still give conservative tests. There are also limit theorems that give asymptotic approximations; see Lehmann (1998, pp. 37ff., 421). When n is not equal to m, the calculations are harder and the asymptotics change. Of course, we can always approximate the null distribution of Dm,n by simulation.

Here is a Matlab m-file to calculate the Smirnov statistic. Let x=(x1, … , xm) denote the vector of control responses and let y=(y1, … , yn) denote the vector of treatment responses.

        function s = smirnov(x, y)
        % function s = smirnov(x, y)
        % P.B. Stark, statistics.berkeley.edu/~stark  2/17/2003
        % calculates the Smirnov distance between two vectors of data
            m = length(x);
            n = length(y);
            z = [x y];
            s = 0;
            for j=1:m+n,
                s = max(s, abs(sum(x <= z(j))/m - sum(y <= z(j))/n));
            end;
        return;
    

The following Matlab function simulates the null distribution of the Smirnov statistic. Again, the data are x and y.

        function dist = simSmir(x, y, iter)
        % function dist = simSmir(x, y, iter)
        % P.B. Stark, statistics.berkeley.edu/~stark  2/17/2003
        %  simulates the null distribution of the Smirnov distance for data x and y
        %  using iter pseudorandom samples
            dist = zeros(iter,1);
            m = length(x);
            n = length(y);
            z = [x y];                                  % pool the N = m+n observations
            for i=1:iter
                zp = z(randperm(length(z)));            % random permutation of the data
                dist(i) = smirnov(zp(1:m),zp(m+1:m+n)); % first m control, last n treatment
            end;
        return;
    

Using these functions, you can find an approximate P-value for the Smirnov test using the following script, which assumes that the data are already in the vectors x and y and that iter has been set (to a number like 10,000).

        testVal = smirnov(x, y);
        simDist = simSmir(x, y, iter);
        pValue = sum(simDist >= testVal)/iter;
    

Here are versions of the functions in R:

    smirnov <- function(x, y) {
    # P.B. Stark, statistics.berkeley.edu/~stark  9/8/2006
		z <- c(x,y);
		s <- 0;
		for (j in 1:length(z)) {
			s <- max(s, abs(sum(x <= z[j])/length(x) - sum(y <= z[j])/length(y)));
		}
		s
    }
    

    simSmir <- function(x, y, iter) {
    # P.B. Stark, statistics.berkeley.edu/~stark  9/8/2006
		dist <- array(0, iter);
		z <- c(x,y);
		for (i in 1:iter) {
			zp = sample(z);
			dist[i] <- smirnov(zp[1:length(x)],zp[(length(x)+1):length(z)]);
		}
		dist
     }
     

        testVal <- smirnov(x, y);
        simDist <- simSmir(x, y, iter);
        pValue <- sum(simDist >= testVal)/iter;
    

Here is a more cryptic "seven at one blow" version:

    simSmirnovTest <- function(x, y, iter) {
    # P.B. Stark, statistics.berkeley.edu/~stark  9/8/2006
        ts <- smirnov(x,y)
        z <- c(x, y)                 # pooled responses
        sum(replicate(iter, {
                              zp <- sample(z);
                              (smirnov(zp[1:length(x)],zp[(length(x)+1):length(z)]) >= ts)
                            }
                     )
            )/iter
    }
    

We can also estimate the power of the Smirnov statistic against a shift alternative by simulation. Suppose that the effect of treatment is to increase each subject's response by d, no matter what the response would have been in control. (Remember, this is an absurdly restrictive assumption, but the results might be suggestive anyway.) Then the responses y=(y1, … yn) of the treated subjects would have been

yd=(y1d, …, ynd)

had they been in the control group, and the responses x of the control group would have been x+d had they been in the treatment group. The following Matlab function simulates the distribution of the Smirnov statistic under the shift alternative. It assumes that x, y and d are already defined.

        function altD = simSmirShift(x, y, d, iter)
        % function altD = simSmirShift(x, y, d, iter)
        % P.B. Stark, statistics.berkeley.edu/~stark  2/17/2003
        % simulates the distribution of the Smirnov statistic under the
        % shift alternative.  x and y are the data, d is the shift.
        % Uses iter pseudorandom samples.
            altD = zeros(iter,1);
            m = length(x);
            n = length(y);
            z = [x y-d];                                   % pool observations. Under the shift
                                                           % alternative, treatment effect is d
            for i=1:iter
                zp = z(randperm(length(z)));               % random permutation of data
                altD(i) = smirnov(zp(1:m),zp(m+1:m+n)+d);  % first m control; last n treatment
                                                           % treatment is shifted by d
            end;
        return;
    

Note that this reduces to the null distribution when d=0, so we really don't need the function simSmir—we can just call simSmirShift with d=0. To find the power of the test, we need to remember that the shift can change the number of ties, depending on which subjects are assigned to treatment and which to control. Thus, for each assignment, we need (in principle) to calculate a new critical value for the test. The following Matlab script does that.

        function p = simSmirPower(x, y, d, alpha, iter)
        % function p = simSmirPower(x, y, d, alpha, iter)
        %
        % P.B. Stark, statistics.berkeley.edu/~stark 3/3/03
        % finds the approximate power of a level alpha Smirnov
        % test against the alternative hypothesis that treatment
        % shifts the response by exactly d for all subjects.
        %
        % x is the vector of control observations
        % y is the vector of treatment observations
        % the lengths of x and y determine the sizes of the control and treatment
        %  groups
        % alpha is the significance level
        % iter is the number of replications of each simulation--note: there are
        %  two nested simulations.
            p = 0;
            z = [x y-d];
            for i=1:iter
                dist = zeros(1,iter);
                zp = z(randperm(length(z)));
                xp = zp(1:length(x));
                yp = zp(length(x)+1:length(x)+length(y)) + d;
                dist = simSmirShift(xp, yp, 0, iter);
                if (sum(dist >= smirnov(xp, yp))/iter <= alpha)
                    p = p+1;
                end;
            end;
            p = p/iter;
        return;
    

The following Matlab script finds a critical value for a conservative level a; test, the P-value, and the power of the test against a shift alternative with shift d. It assumes that x, y, a, d and iter are already defined. It uses the function upperCriticalVal(), whose definition follows.

        testVal = smirnov(x, y);                    % value of the test statistic
        simDist = simSmirShift(x, y, 0, iter);      % simulated null distribution
        critVal = upperCriticalVal(simDist, a);     % find critical value for the test
        pValue = sum(simDist >= testVal)/iter;      % approximate p-value
        power = simSmirPower(x, y, d, alpha, iter); % estimated power
    

        function cv = upperCriticalVal(list, alpha)
        % function upperCriticalValue(list, alpha)
        %
        % P.B. Stark statistics.berkeley.edu/~stark  2/22/08
        % Finds a value cv in the list such that at most a
        %   fraction alpha of elements of the list are greater
        %   than or equal to cv
            n = length(list);
            if (alpha <= sum(list == max(list))/n | alpha > 1)
                disp(['Error in upperCriticalVal: unattainable fraction ' num2str(alpha)]);
                cv = NaN;
            elseif (alpha == 1)
                cv = min(list);
            else
                list = sort(list);
                trialInx = ceil(n*(1-alpha));             % first guess at critical value,
                                                          % but there could be duplicates.
                cv = list(trialInx);                      % first guess at critical value
                sigAttain = sum(list >= cv)/n; % attained estimated significance
                while (sigAttain > a & trialInx < n)   % for conservative level-a test
                    trialInx = trialInx + 1;
                    cv = list(trialInx);
                    sigAttain = sum(list >= cv)/n;
                end;
            end;
        return;