Suppose a treatment has N ordered levels, and one response is observed per treatment level, with N subjects assigned at random, one to each treatment level. Let T_j denote the rank among the responses of the response of the subject assigned to the treatment level with rank j. Then we have a set of pairs of ranks of treatments and responses:

There are N! possible sets of pairs, depending on which treatment rank goes with with response rank. If treatment has no effect, all N! pairings are equally likely; each has probability 1/N!. This is the "Hypothesis of Randomness."

Now consider drawing a random sample of N subjects from a population, and assigning one to each of the N treatment levels. Let Z_j denote the response of the subject assigned treatment j.

If the size of the population is much much larger than N, we can ignore the dependence among the subjects. That leads to the population model, in which

are independent. Under the null hypothesis, they are thus iid, and so the null distribution of the pairs {(j, T_j)} is the same as it is for the hypothesis of randomness.

Consider a third statistical model: we make two measurements (X_j, Y_j) on each of a set of N subjects. Under the null hypothesis, the Xs are identically distributed, the Ys are identically distributed, and the measurements are independent. This also leads to the same joint distribution for the pairs {(j, T_j)}, where T_j is the rank of the Y measurement corresponding to the X measurement with rank j. The key here is that one set of measurements is exchangeable given the other.

Tests for Trend

We are going to examine tests for a trend: the alternative hypothesis is that increasing levels of treatment are associated with increasing (or decreasing) levels of response.

The Spearman rank correlation & related statistics

The first statistic we consider counts the pairs (T_i,T_j) with i<j for which T_i<T_j. Let

Then B tends to be large when larger responses are associated with larger treatment indices j. A more common test statistic is D':

Thus to reject when D is small is to reject when ∑_i=1^NiT_i is big. Equivalently, it is to reject when ∑_i=1^N(R_iT_i) is big, where R_i is the rank of the ith treatment and S_i is the rank of the ith response. Let

(1/N) ∑_i(R_i − r)² = (1/N) ∑_i(S_i − s)² = (1/N) (N³−N)/12 = (N²−1)/12.

r_S ≡ [(1/N) ∑_i (R_i − r)(S_i − s)]/[ √[(1/n)∑(R_i−r)²] √[(1/n)∑(S_i−s)²] ]

If there are ties, we can define analogous test statistics using midranks instead of ranks. Note that if the null hypothesis is true and there are no ties, Er_S = 0, and

where the limit is in distribution. This approximation is pretty good for N≥10 or so. Alternatively, one can simulate critical values for a test based on Spearman's r_S by computing the usual correlation coefficient between {1, … N} and random permutations of {1, … N}.

For a bivariate normal distribution of (X,Y), the Pitman efficiency of r_S relative to the usual Pearson correlation coefficient is (3/π)² ≅ 0.912.

One alternative to Spearman's rank correlation coefficient is to use Pearson's correlation coefficient, but to compute critical values by simulation using random permutations of the responses to treatment {Y_j} relative to the treatments {X_j}.

The run test

For many alternatives, high values tend to cluster and low values tend to cluster, leading to "runs." We can use this to test the hypothesis of randomness by looking at the lengths and numbers of runs. For example, suppose we toss a (possibly biased) coin N times. We can think of this as N trials. We consider the outcome of the jth trial to be 1 if the coin lands heads on the jth toss and 0 otherwise. Under the null hypothesis, the N outcomes are iid Bernoulli(p) random variables, where P is the chance that the coin lands heads.

Let R denote the number of runs. For example, the sequence HHTHTTTHTH has 7 runs: HH, T, H, TTT, H, T and H. Condition on the number n of heads among the tosses. If the null hypothesis is true, each arrangement of the n heads among the N tosses has probability 1/(_NC_n). We will compute the (conditional) probability distribution of R given n, under the null hypothesis of independence.

Clearly, if n=N or if n = 0, R ≡ 1. If k = 1, there are only two possibilities: first all the heads, then all the tails, or first all the tails, then all the heads. I.e., the sequence is either

where there are n Hs and m ≡ N−n Ts in all. The probability that R=1 is thus 2/(_NC_n), if the null hypothesis is true.

How can R be even, i.e., R = 2k? If the sequence starts with a head, we need to choose where to break the sequence of heads to insert tails, then where to break that sequence of tails to insert heads, etc. If the sequence starts with a tail, we need to choose where to break the sequence of tails to insert heads, then where to break that sequence of heads to insert tails, etc. We need to break the n heads into k groups, which means picking k − 1 breakpoints, but the first breakpoint needs to come after the first H, and the last breakpoint needs to come before the nth H, so there are only n − 1 places those k − 1 breakpoints can be. And we need to break the m tails into k groups, which means picking k − 1 breakpoints, but the first needs to be after the first T and the last needs to be before the mth T, so there are only m − 1 places those k − 1 breakpoints can be. The number of sequences with R = 2k that start with H is thus

The number of sequences with R = 2k that start with T is the same (just read right-to-left instead of left-to-right). Thus, if the null hypothesis is true,

Now consider how we can have R = 2k+1. Either the sequence starts and ends with H or it starts and ends with T. Suppose it starts with H. Then we need to break the string of n heads in k places to form k + 1 groups using k groups of tails formed by breaking the m tails in k−1 places. If the sequence starts with T, we need to break the m tails in k places to form k + 1 groups using k groups of heads formed by breaking the n heads in k−1 places. Thus, under the null hypothesis,

Note that nothing in this derivation used the probability of heads. The conditional distribution under the null hypothesis depends only on the fact that the tosses are iid, not that the coin is fair.

Let I_j be the indicator of the event that the outcome of the j+1st toss differs from the outcome of the jth toss, j=1, …, N−1. Then

P(I_j = 1) = P(I_j = 1 | jth toss lands H, n)P(jth toss lands H | n) + P(I_j = 1 | jth toss lands T, n)P(jth toss lands T | n)

= P(j+1st toss lands T | jth toss lands H, n)P(jth toss lands H | n) + P(j+1st toss lands H | jth toss lands T, n)P(jth toss lands T | n)

The indicators {I_j} are identically distributed under the null hypothesis, so if the null holds,

Air temperature is measured at noon in a climate-controlled room for 20 days in a row. We want to test the null hypothesis that temperatures on different days are independent and identically distributed.

Let T_j be the temperature on day j, j = 1, …, 20. If the measurements were iid, whether each day's temperature is above or below a given temperature t is like a toss of a possibly biased coin, with tosses on different days independent of each other. We could consider a temperature above t to be a head and a temperature below t to be a tail.

Let's take t to be the median of the 20 measurements. In this example, n=10, m=10, N=20. We will suppose that there are no ties among the measured temperatures. Under the null hypothesis, the expected number of runs is

ER = 1+2mn/N = 11.

The minimum possible number of runs is 2 and the maximum is 20. Since we expect temperature on successive days to have positive serial correlation (think about it!), we might expect to see fewer runs than we would if temperatures on different days were independent. So, let's do a one-sided test that rejects if there are too few runs. We will aim for a test at significance level 5%.

P(R = 2) = 2/₂₀C₁₀ = 1.082509e-05.

P(R = 3) = 2×₉C₁×₉C₀/₂₀C₁₀ = 9.74258e-05.

P(R = 4) = 2×₉C₁×₉C₁/₂₀C₁₀ = 8.768321e-04.

P(R = 5) = 2×₉C₂×₉C₁/₂₀C₁₀ = 0.003507329.

P(R = 6) = 2×₉C₂×₉C₂/₂₀C₁₀ = 0.01402931.

P(R = 7) = 2×₉C₃×₉C₂/₂₀C₁₀ = 0.03273507.

P(R ≤ 6) = 2×(2+9+81+324+1296)/₂₀C₁₀ ≈ 0.0185.

So, we should reject the null hypothesis if R ≤ 6, which gives a significance level of 1.9%. Including 7 in the rejection region would make the significance level slightly too big: 5.1%.

When N, n and m are large, the combinatorics can be difficult to evaluate numerically. There are at least two options: asymptotic approximation and simulation. There is a normal approximation to the null distribution of R. As n and m→∞ and m/n→γ,

Here is an R function to simulate the null distribution of the number R of runs, and evaluate the P-value of the observed value of R conditional on n, for a one-sided test against the alternative that the distribution produces fewer runs than independent trials would tend to produce. The input is a vector of length N; each element is equal to either "1" (for heads) or "-1" (for tails). The test statistic is calculated by finding 1 + ∑_j=1^N−1 I_j, as we did above in finding ER.

As an example, suppose the observed sequence is x = (-1, -1, 1, 1, 1, -1, 1), for which N = 7, n = 4, m = 3 and R = 4. In one trial with iter = 10,000, the simulated P-value using simRunTest was 0.5449. Exact calculation gives:

Tests for independence between time series

The crucial element of the null hypothesis that leads to the null distribution of the Spearman rank correlation is that one of the sets of measurements is conditionally exchangeable given the other. Then, if the null hypothesis is true, all pairings (X_i, Y_j) are equally likely. However, time series data often have serial correlation, such as trends. Suppose we have two independent time series, each of which has a trend. For example, we might observe the pairs (X_j,Y_j)_j=1^N, where

and the 2N "noise" terms {ε_j} and {ν_j} are iid with zero mean and finite variance. Even though the time series {X_j} and {Y_j} are independent, neither {X_j} nor {Y_j} are exchangeable. As a result, the Spearman rank correlation test will tend to reject the hypothesis of independence, not because the two sets of measurements are dependent, but because a different feature of the null hypothesis is false. The trend makes pairings (X_i, Y_j) with both X and Y larger than average or both smaller than average more likely than pairings where one is relatively large and the other is relatively small.

Walther's Examples

The following examples are from Walther (1997, 1999). Suppose that {X_j}_j=1¹⁰⁰ and {Y_j}_j=1¹⁰⁰ are iid N(0,1). Define

where c_0.01 is the critical value for a one-sided level 0.01 test against the alternative of positive association. That is, even though the two series are independent, the probability that the Spearman rank correlation coefficient exceeds the 0.01 critical value for the test is over 2/3. That is because the two series S and T each have serial correlation, so not all pairings (S_i, T_j) are equally likely—even though the two series are independent.

Serial correlation is not the only way that exchangeability can break down. For example, if the mean or the noise level varies with time, that violates the null hypothesis. Here is an example of the latter. Take X = (1, 2, 3, 4) to be fixed. Let (Y₁, Y₂, Y₃, Y₄) be independent, jointly Gaussian with zero mean, SD(Y_j) = 1, j = 1, 2, 3, and Var(Y₄) = 4. Then

Note that in this example, r_S = 1 whenever Y₁<Y₂<Y₃<Y₄. Simulation shows that in fact P(r_S(X, Y)=1) is about 7%:

In a similar vein, take X = (1, 2, 3, 4, 5) to be fixed, let (Y₁, Y₂, Y₃, Y₄, Y₅) be independent, jointly Gaussian with zero mean and SDs 1, 1, 1, 3, and 5, respectively. Then, under the (false) null hypothesis that every (X, Y) pairing is equally likely,

but simulation shows that the actual probability is about 2.1%. In these examples, the null hypothesis is false, but not because X and Y are dependent. It is false because not all pairings (X_i,Y_j) are equally likely. It is the "identically distributed" part of the null hypothesis that fails.

The solar neutrino problem

Sun emits neutrinos as a byproduct of nuclear fusion in the solar core. Nuclear theory predicted a given neutrino flux; observations showed a rather lower flux than predicted. Why? Scientists speculated about mechanisms for many years. There is an apparent negative association between sunspot number (a measure of the activity of magnetic fields in Sun) and neutrino flux. As of about 2002, the Homestake experiment had collected about 30 years of data on the solar neutrino flux by measuring daily production of ³⁷Ar production (atoms per day). (They published upper and lower 68% confidence limits.)

In the early 2000's, physicists found that neutrinos have mass. Neutrino mass could explain the apparent deficit of solar neutrinos through state oscillations.

References

Walther, G., 1997. Absence of correlation between the solar neutrino flux and the sunspot number, Phys. Rev. Lett., 79, 4522–4524.

Walther, G., 1999. On the solar-cycle modulation of the Homestake solar neutrino capture rate and the shuffle test, Astrophys. J., 513, 990–996.

Tests of Randomness and Independence