Hypothesis Testing: Does Chance explain the Results?

An important branch of Statistics, Statistical Decision Theory, addresses the problem of making decisions—such as choosing between two competing hypotheses about the world—on the basis of uncertain data. In we treated the "Let's Make a Deal" problem as a decision between two hypotheses: the hypothesis that switching one's guess of which door hides the prize improves the chance of winning, and the hypothesis that switching one's guess makes no difference to the chance of winning. We saw that there were two kinds of possible errors: deciding that switching was better when in fact it was not, and vice versa.

This chapter discusses rules for deciding between competing hypotheses on the basis of data that have a random component (such as draws from a box of tickets). The competing hypotheses are called the null hypothesis and the alternative hypothesis. The rules are called hypothesis tests or hypothesis testing procedures. Typically, the null hypothesis is that something is not present, that a treatment has no effect, or that there is no difference between two parameters. Typically, the alternative hypothesis is that some effect is present, that a treatment has an effect, or that two parameters differ. The main requirement of the null hypothesis is that it must be possible to compute the probability that the test rejects the null hypothesis when the null hypothesis is true. That probability is called the significance level of the test. (When in doubt, choose the simpler of the hypotheses to be the null hypothesis—usually that will lead to easier computations.)

The two types of error are as follows:

Controlling the chances of these two kinds of error is crucial.

Examples of Hypothesis Testing Problems

Many questions we encounter daily can be cast as hypothesis testing problems. Here are some examples:

There is a tradeoff between Type I and Type II errors: within a given "technology" for testing hypotheses, decreasing the rate of Type I errors increases the rate of Type II errors, and vice versa. Consider the airport metal detector, for example. To increase the chance of detecting a weapon, one needs to increase the sensitivity of the detector. This will lead to more false alarms. (The detector will be triggered more frequently by belt buckles, watches, pens, etc.) To decrease the rate of false alarms, one must decrease the sensitivity, which makes it easier for someone to get a weapon through the system without setting off the alarm. There might be some other technology for detecting weapons that could be more sensitive to weapons and simultaneously have a lower false alarm rate (for example, searching every bag by hand, and frisking the passengers); the tradeoff is within a given technology.

How to Tell the Liars from the Statisticians (R. Hooke, 1983. Marcel Dekker, Inc., NY, 173pp) characterizes the difference between liberal and conservative politics in terms of Type I and Type II errors. In offering public assistance, such as welfare, a Type I error is to give public assistance to someone who is not really deserving, and a Type II error is to fail to give support to someone who really needs it. In this context, conservatives tend to find Type I errors intolerable, and liberals tend to find Type II errors intolerable. In punishing crime, the opposite is true: our legal system holds that someone is innocent until proven guilty, so a Type I error occurs if an innocent person is punished, and a Type II error occurs if a guilty person is not punished. Here, liberals tend to find Type I errors intolerable, and conservatives tend to find Type II errors intolerable. Because it is not possible to eliminate one type of error without increasing the frequency of the other type of error, the two political philosophies are at odds because they advocate opposite extremes of the error tradeoff.

The following exercises check your ability to identify null and alternative hypotheses, and Type I and Type II errors.

Significance Level and Power

The significance level of an hypothesis test is the chance that it makes a Type I error—the chance that it rejects a true null hypothesis. The significance level of a test often is denoted by the lowercase Greek letter alpha, (α).

The power of an hypothesis test against a specific alternative hypothesis is the chance that the test correctly rejects the null hypothesis when that alternative hypothesis is true; that is, the power is 100% minus the chance of a Type II error when that alternative hypothesis is true. The chance of a Type II error is often denoted by the lowercase Greek letter beta (β), so the power is (100% − β).

The significance level and the power of a test are the probability of the same event, the event that the null hypothesis is rejected. The difference between significance level and power is the assumption about the world we use to compute the probability: to compute the significance level, we assume that the null hypothesis is true; to compute the power, we assume that the alternative hypothesis is true.

Significance Level and Power

The significance level of an hypothesis test is the chance that the test rejects the null hypothesis, on the assumption that the null hypothesis is true.

The power of an hypothesis test against a particular alternative hypothesis is the chance that the test rejects the null hypothesis, on the assumption that that alternative hypothesis is true.

What makes a test good? Small significance and high power do. We want the significance level to be small, so that a Type I error is unlikely. And we want the power to be as large as possible, so that a Type II error is unlikely. As noted previously, in most problems there is a tradeoff between these goals: Decreasing the chance of a Type I error tends to increase the chance of a Type II error, and vice versa, but there are exceptions. One way statisticians develop hypothesis tests is to consider all tests with a given significance level, and choose the one with the largest power against all plausible alternatives (this is not always possible).

A statistician of dubious judgement wants to determine whether a revolver that can hold six cartridges is loaded. He proposes to test the null hypothesis that the gun is not loaded by "spinning" the cylinder to align one of the six chambers with the barrel at random, then pulling the trigger twice in succession. If the gun goes off either time the trigger is pulled, he will conclude that the gun was loaded (reject the null hypothesis). If the gun does not go off, he will conclude that the gun was not loaded. We shall find the significance level of this test, and the power of the test against the alternative hypothesis that the gun has 1 chamber loaded and against the alternative hypothesis that the gun has 5 chambers loaded.

If the gun is not loaded, it will not go off either time the trigger is pulled, so the test cannot reject the null hypothesis erroneously: the significance level of the test is zero.

Suppose one chamber is loaded. The chance that the gun goes off is 100% minus the chance that it does not go off. It does not go off if the loaded chamber is one of the four that are not tried. The chance of that is 4/6, so the chance that the gun does go off if one chamber is loaded is 2/6 = 1/3. Thus the power of the test against the alternative that one chamber is loaded is 1/3.

If five chambers are loaded, there is only one empty chamber, so if the trigger is pulled twice, the gun will go off at least once. Thus the power of the test against the alternative that five chambers are loaded is 100%.

Test Statistics and P-values

The most common way to test an hypothesis is to choose a test statistic to compute from the data, then to reject the null hypothesis if the value of the test statistic is outside some range, or exceeds some threshold. The set of values of the test statistic for which we would reject the null hypothesis is called the rejection region. The rejection region is chosen (before collecting the data) so that if the null hypothesis is true, the chance that the test statistic is in the rejection region is at most the desired significance level. Typical values for the significance level are 10%, 5%, and 1%, but the choice is arbitrary. In some circumstances, no fixed rejection region will give exactly the desired significance level; in that case, one should choose the rejection region so that the chance of rejecting the null hypothesis if it be true is as large as possible without exceeding the significance level.

Suppose we have a family of tests that let us test the null hypothesis at any significance level p between 0 and 100%. The P-value of the null hypothesis given the data is the smallest significance level among the tests that would reject the null hypothesis. For example, let X be a test statistic, and for p between 0 and 100%, let xp be the smallest number x such that, if the null hypothesis be true,

P( X ≥ x ) ≤ p.

Then for any p between 0 and 100%, the rule

reject the null hypothesis if X ≥ xp

tests the null hypothesis at significance level p. If we observed X = x, the P-value of the null hypothesis given the data would be the smallest p such that x ≥ xp.

The smaller the P-value, the stronger the evidence against the null hypothesis.

In , which tests whether a gun is loaded, the test statistic is the number of times the gun goes off when the cylinder is spun and the trigger is pulled twice. The statistician rejects the null hypothesis if the test statistic is one or greater: The rejection region is the set {1, 2}. The P-value is 100% if the gun does not go off (if the test statistic equals zero); it is zero if the gun goes off (if the test statistic equals one or two).

illustrates designing a test using a test statistic, choosing a rejection region, and calculating a P-value. The example is dynamic: The data tend to change when you reload the page.

The basic steps in statistical hypothesis testing are:

  1. Formulate the null and alternative hypotheses.
  2. Specify the maximum permissible chance of a Type I error (the significance level of the test).
  3. Choose the procedure that will be used to test the hypothesis. Typically, the procedure is to compute a test statistic from the data, and to reject the null hypothesis if the value of the test statistic is in some rejection region. The rejection region is determined by insisting that the significance level be the value chosen in the previous step: the chance that the statistic is in the rejection region if the null hypothesis be true must be no larger than the significance level.
  4. Collect the data.
  5. Compute the test statistic. Reject the null hypothesis if the test statistic falls in the rejection region; otherwise, do not reject the null hypothesis. (Alternatively, report the P-value of the null hypothesis.)

Zener cards often are used to test claims of extra-sensory perception (ESP), such as telepathy (mind reading) and clairvoyance (knowing something without perceiving it with the usual five senses). Each Zener card has one of five geometric figures on it: a star, a square, a circle, wavy lines, or a plus sign. Zener cards were developed by Dr. Karl Zener, of Duke University, and were first used to study ESP by Dr. J.B. Rhine (1895–1980), who allegedly coined the term extra-sensory perception.

Consider using Zener cards to test a psychic's claimed ability to sense what card someone is looking at. Imagine shuffling the five cards well, then looking at each one in turn (without showing the card to the psychic). Each time we look at a card, the psychic writes down which of the five cards she thinks we are looking at.

Rules

  1. The psychic does not get to learn the actual symbol on the card, nor whether she was right in any particular case, until we have gone through the whole deck.
  2. We do not look at what the psychic wrote until we have gone through the whole deck.
  3. The psychic must assign every symbol to exactly one card. For example, the psychic cannot say that the first card and the third card both are labeled with circles.

We do not reveal the correct answer to the psychic, nor tell her whether her determination was right or wrong, until we have passed through the entire deck. Otherwise, she could use that information to improve subsequent guesses. For example, if the psychic gets to see each card after making her determination, she is guaranteed to be able to get the last card right by a process of elimination. We should not learn the psychic's determinations until the test is over, so that we do not inadvertantly give information away through facial expressions, etc. We insist that the psychic not repeat a symbol; otherwise, the psychic could be certain to identify at least one card correctly, merely by repeating a single determination five times (for example, saying every card is marked with the circle).

Our null hypothesis will treat the psychic's determinations as a fixed permutation of the five cards. The order of the shuffled deck is equally likely to be any of the 5! permutations of the cards. The test statistic will be the number X of cards that are in the same place in the psychic's permutation as they are in the shuffled deck. We will use the observed value of X to decide between the hypothesis that the psychic has ESP, and the hypothesis that the psychic does not have ESP.

The following exercises test your understanding of the Zener card example.

Videos of Exercises

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Calculating the probability distribution is complicated; the calculation appears in a footnote. shows the probability distribution of X if the null hypothesis is true.

Null probability distribution of the number of "hits" in the Zener card test
x P(X = x)
0 44/5! = 11/30
1 45/5! = 3/8
2 20/5! = 1/6
3 10/5! = 1/12
5 1/5! = 1/120

The following exercises check your ability to use this probability distribution to construct an hypothesis test for ESP.

Hypotheses about parameters: One-sided and Two-sided Alternative Hypotheses.

Quite commonly, the null hypothesis is that a parameter μ equals some particular value a (the null value), and the alternative hypothesis is that μ is greater than a, that μ is less than a, or simply that μ is not equal to a. (The first two are one-sided alternative hypotheses; the last is a two-sided alternative hypothesis.) Many of the examples from the beginning of this chapter can be written this way.

In these examples, all the alternative hypotheses are one-sided: they assert that the value of the parameter μ is on one side of the null value a. That is, each null hypothesis asserts that μ = a, and each alternative hypothesis either asserts that μ < a, or it asserts that μ > a. In contrast, if we wanted to test whether a coin was fair, the null hypothesis would be (chance of tails = 50%), and the alternative hypothesis could be (chance of tails >50% or <50%). That is a two-sided alternative hypothesis: it asserts that μ is not equal to a.

A good test has as much power as it can against every plausible alternative—while maintaining its significance level. An hypothesis test about the value of a parameter that is designed to have as much power as possible against alternative values of the parameter on both sides of the null value is called a two-sided test. A test that is designed to have as much power as possible against alternative values of the parameter on only one side of the null value is called a one-sided test.

To pick a rejection region given a test statistic X, a null hypothesis, and an alternative hypothesis, we think about how the distribution of the test statistic under the null hypothesis differs from its distribution under the alternative hypothesis. If the test statistic is likely to be larger if the alternative hypothesis be true than if the null hypothesis be true, it makes sense to use a rejection region of the form {X > x0}; we would choose x0 so that if the null hypothesis is true, the chance that X > x0 is at most the significance level. If the test statistic is likely to be smaller if the alternative hypothesis is true than if the null hypothesis is true, it makes sense to use a rejection region of the form {X < x0}; we would choose x0 so that if the null hypothesis is true, the chance that X < x0 is at most the significance level. If the test statistic is likely to be further from some reference point x0 if the alternative hypothesis be true than if the null hypothesis be true, it makes sense to use a rejection region of the form

{X < x1 ∪   X > x2}.

we would choose x1 and x2 so that the chance that X is in the rejection region if the null hypothesis is true is at most the significance level; we would also tend to choose them so that the probability that X < x1 is equal to the probability that X > x2 if the null hypothesis is true. The following exercises check whether you understand when to use a one-sided test and when to use a two-sided test.

Example. (Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Case Study: Employment Discrimination Arbitration

This example is based on a true story. The names have been changed, but other than that, the facts are stated as I understand them.

Service, Inc., provides janitorial services under contract to large organizations. Because of the nature of their business, the turnover of their service employees tends to be somewhat high. A number of people who had been fired from service positions at a particular branch of Service, Inc., between 22 June 1996 and 8 September, 1997, filed suit against Service, Inc., claiming that they were discriminated against on the basis of gender, age, and/or ethnicity. In particular, the suit alleged that women over the age of 40 were fired more often than other groups. I was retained in late 1997 to examine summary employment data for evidence of discrimination on the basis of gender, age, and ethnicity.

I was given summary employment listings for 143 service employees who had worked for Service, Inc., at that location, at any time between 22 June 1996 and 8 September 1997. The summary listings included age, gender, and ethnicity, for all but two of the employees. Those employees had Hispanic surnames; I imputed their ethnicity to be Hispanic. The summary listings also indicated whether the employee was still working for Service, and if not, whether their termination was voluntary (resignation, leave-of-absence, etc.) or involuntary (the person was fired). Among the 143 entries, 24 recorded involuntary terminations; the remaining 119 were for individuals who were still employed, on leave, or had left voluntarily.

I divided the employees into two groups by age: those whose employment by Service, Inc., ended before their 40th birthday, or who were still employed but were not yet 40 years old as of 8 September 1997; and those whose employment ended after their 40th birthday, or who were still employed but were at least 40 years old as of 8 September 1997.

show the genders, ethnicity groups, and age groups of the 143 employees, broken down by whether or not they were terminated involuntarily (fired).

Termination by gender
Termination Female Male Total
Involuntary 14 10 24
Other 67 52 119
Total 81 62 143

 

Termination by ethnicity: white versus other
Termination 1 2, 3, 4, 5 Total
Involuntary 8 16 24
Other 44 75 119
Total 52 91 143

 

Termination by ethnicity: all ethnicity categories
Termination 1 2 3 4 5 Total
Involuntary 8 8 3 0 5 24
Other 44 23 29 3 20 119
Total 52 31 32 3 25 143

 

Termination by age group
Termination Under 40 40 and Over Total
Involuntary 14 10 24
Other 61 58 119
Total 75 68 143

 

Termination by gender and age group
Termination Female Male Total
Under 40 40 and over Under 40 40 and over  
Involuntary 7 7 7 3 24
Other 25 42 36 16 119
Total 32 49 43 19 143

 

Termination by gender and ethnicity
Termination Female Male Total
1 2 3 4 5 1 2 3 4 5
Involuntary 3 6 1 0 4 5 2 2 0 1 24
Other 28 16 10 1 12 16 7 19 2 8 119
Total 31 22 11 1 16 21 9 21 2 9 143

*Ethnicity: 1, White; 2, Black; 3, Hispanic; 4, Native American/Alaskan; 5, Asian and Pacific Islander.

 

Analysis

How might we assess whether Service, Inc., discriminated in firing on the basis of age, gender, and/or ethnicity? One way is to ask whether the age, gender, and ethnicity breakdown of the 24 involuntarily terminated employees is surprisingly different from the breakdown that would be expected had 24 of the 143 employees been selected at random. This is not to suggest that people really are fired at random, nor that competence, reliability, and adequate job performance are necessarily equal (even on the average) for different demographic groups. Rather, the question is whether the assumption that involuntary terminations were blind to age, gender, and ethnicity is compatible with the data. We shall take the total number of people terminated involuntarily as a given, 24 (we shall condition on the number of people terminated involuntarily).

For example, consider the table of termination by gender. Of the 143 employees, 81 were female (81/143 = 56.64%); of the 24 employees who were fired, 14 were female (14/24 = 58.33%). Suppose that 24 employees were selected at random without replacement from the 143 employees in the period in question. Would it be surprising if 14 or more of those 24 employees were women?

There is a Federal case, Equal Employment Opportunity Commission v. Federal Reserve Bank of Richmond, 673 F.2d 798 (1983), that says one should look at whether the firing rates are surprisingly large or surprisingly small in making this assessment (in statistical parlance, one should use two-sided rather than one-sided hypothesis tests). That is, we should look at the mean number of women in all possible simple random samples of 24 employees from the 143, and look at the difference between that average and the number of women actually fired. If that difference is surprisingly large (if the number fired is much larger or much smaller than the average), there is prima facie evidence of discrimination—possibly reverse discrimination. If differences that large or larger are relatively likely to occur in a simple random sample of 24 employees from the 143, there is no prima facie evidence of discrimination: the "luck of the draw" is sufficient to explain the observed difference.

The number of women in a simple random sample from the employees is like the number of tickets labeled "1" in n draws without replacement from a box of N tickets of which G are labeled "1" and the rest are labeled "0" is n×G/N. That number has an hypergeometric distribution with parameters N, G and n. We saw previously that the expected value of the number of tickets labeled "1" is n×G/N. The expected value of the number of women in a simple random sample of 24 employees is thus

24 × (81/143) = 13.594 women.

One would not expect to see a fractional number of women in the sample; nonetheless, this is how the expected value is defined (it is the long-run average number of women in repeated simple random samples of size 24, or the probability-weighted average of the possible number of women in the sample). Because of the luck of the draw, the number of women in the sample will vary from draw to draw, but it is likely to be in a range around 14. The chance of each possible number of women in the sample (from 0 to 24) is given by the hypergeometric distribution.

Similarly, the chance of each possible number of people under the age of 40 in the sample, and of the number of people of each ethnicity in the sample, all have hypergeometric distributions with N = 143 and n = 24, but with difference values of G.

What happens when we want to look at more than two groups at a time, for example, the breakdown by age and gender? For example, what is the chance that a random sample of 24 of the employees has 7 women under 40, 7 women 40 or older, 7 men under 40, and 3 men 40 or over?

If we were to make a box model for the draws, the tickets in the box would have more than two labels (male under 40, male 40+, female under 40, female 40+), so this is not like the number of tickets labeled "1" in draws from a box that has tickets labeled "0" and "1." However, we can use the same kind of reasoning we have used in the last few chapters to figure out the chance. It is just like the chance of a card hand, dealing a 24-card hand from a deck of 143 cards that has only one suit, and four kinds of cards, with different numbers of each kind of cards. One kind of card corresponds to males under 40, one kind to males 40 and over, etc.

The total number of ways to draw 24 employees without replacement from the 143 is 143C24. These are equally likely in simple random sampling—indeed, that is the definition of a simple random sample. How many of those ways result in 7 women under age 40, 7 women age 40 or older, 7 men under age 40, and 3 men age 40 or over? We can use the fundamental rule of counting. There are 32 female employees under age 40, so there are 32C7 ways to select 7 of them. There are 49 female employees age 40 and over, so there are 49C7 ways to select 7 of them. There are 43 male employees under age 40, so there are 43C7 ways to select 7 of them. There are 19 male employees age 40 and over, so there are 19C3 ways to select 3 of them. By the fundamental rule of counting, there are

32C7×49C7×43C7 ×19C3

ways to make all these choices. The chance that a simple random sample of 24 employees has 7 women under age 40, 7 women age 40 or older, 7 men under age 40, and 3 men age 40 or over is thus

32C7×49C7×43C7×19C3
--------------------------- = 0.81%.
143C24

To assess whether the data evidence discrimination, I calculated the chance that the ethnicity, gender, and age proportions of a group of 24 employees chosen at random from the population of 143 would differ from the corresponding proportions among the 143 by as much or more than observed. These are the P-values the null hypotheses of no discrimination on the basis of

A large probability indicates that the departure from proportional representation can be accounted for reasonably by chance variation—the luck of the draw. A small probability is prima facie evidence of discrimination (assuming that the terminations were not for cause). It is my understanding that in discrimination cases, the threshold probability for inferring that discrimination has taken place is at most 5%. The results are as shown in . None of the divisions shows a surprising departure from its expected value in a simple random sample of 24 employees.

P-values for the null hypotheses of "no discrimination"
Subgroup(s) Probability
Gender 82.4%
Age 82.6%
Gender and age 8.9%
White v. other 81.9%
All ethnicities 99.4%
All ethnicities and gender 99.9%

Conclusions

If 24 of the 143 employees were terminated completely at random, there would be more than an 80% chance that the proportions of protected minorities terminated involuntarily would differ from their corresponding proportions in the employee pool by at least as much as these data show. The data are quite consistent with the hypothesis that there was no discrimination by age, ethnicity, or gender. Empirically, employees age 40 and over were less likely to be fired than employees under age 40, with women age 40 and over less likely to be fired than men age 40 and over.

The following exercise asks you to test the null hypothesis that two probabilities are equal. As is the case in the previous example of no discrimination, the null hypothesis—that a decision is made randomly—is contrived. Nonetheless, the large P-value is suggestive.

Caveats

Hypothesis tests need to be interpreted with care. Rejecting the null hypothesis does not mean that the null hypothesis is false, nor does failing to reject the null mean that the null hypothesis is true. Practical importance and statistical significance have little to do with each other. P-values often are misinterpreted. The number of tests performed matters. The fraction of rejected null hypotheses that are rejected in error depends on more than just the significance level.

The Meaning of Rejection

In testing hypotheses, we speak of rejecting the null hypothesis or not rejecting the null hypothesis. We do not speak of accepting the null hypothesis or the alternative hypothesis. Statisticians use data to show that some possibilities are implausible, but there are always many possible explanations for the data we observe. Moreover, if the data are poor or few in number, typically they cannot provide strong evidence against the null hypothesis, even if the null hypothesis be false. We should not interpret poor or inconclusive data as supporting the null hypothesis. Sometimes we can rule out an hypothesis as being inconsistent with the data (if the data are extremely unlikely on the assumption that the hypothesis are true), but the set of hypotheses that are consistent with the data usually contains more than just the null and alternative hypotheses. The precise statistical statement when we reject a null hypothesis is that either the null hypothesis is false, or an event has occurred that has probability no larger than the significance level.

Rejecting the null hypothesis

Not rejecting the null hypothesis does not mean that the null hypothesis is true, nor that the data support the null hypothesis.

In particular, if the data are few or poor, it is hard for a test to have much power—it is hard for a test to reject a false null hypothesis.

Rejecting the null hypothesis does not mean that the alternative hypothesis is true.

It means that either the null hypothesis is false, or an event has occurred that has probability no larger than the significance level.

It is not hard to a construct "straw man" null hypothesis that will succumb to the slightest contact with data.

Statistical Significance and Practical Importance

If the null hypothesis is rejected, one says that the effect or test is "statistically significant at level___," where the significance level or the P-value goes in the blank. "At level___" often is omitted, which makes it impossible to know what the chance of a false alarm might be. All too often, the word "statistically" is dropped too, leading one to think that the effect is important, not merely detectable. The difference between importance and detectability is considerable. A small, unimportant effect can be detected if there are sufficiently many data of sufficiently high quality. Conversely, an effect can be both large and important, but not statistically significant if the data are few or of low quality. That can lead to peculiar locutions, such as "no other leading brand has been shown to surpass ZZZ." Aside from the ambiguity in the word "leading," one might not reject the null hypothesis that no brand is better than ZZZ because ZZZ really is at least as good as all other brands, or because the data are too few or of too low quality to allow one to detect that another brand actually is better than ZZZ.

Statistical Significance and Practical Importance

Practical significance (importance) and statistical significance (detectability) have little to do with each other.

An effect can be important, but undetectable (statistically insignificant) because the data are few, irrelevant, or of poor quality.

An effect can be statistically significant (detectable) even if it is small and unimportant, if the data are many and of high quality.

Interpreting P-values

A common mistake in hypothesis testing is to misinterpret the P-value or significance level; in particular, to consider the P-value or significance level to be the probability that the null hypothesis is true. The data have a random component, but the truth of the null hypothesis is not random—the null hypothesis is either true or false, regardless of what data we observe. The P-value is a probability computed on the assumption that the null hypothesis is true. As is the case for confidence intervals, chance is meaningful only before the data are collected. The null hypothesis is either true, or not. Once the data have been collected, there is no chance left: the hypothesis testing procedure either rejects the null hypothesis, or not. Depending on whether the null hypothesis is true, an error occurs, or not.

Multiplicity and Data Mining

The significance levels we have been computing are for testing a single hypothesis with a single test. Suppose we were interested in whether any of the "brain cocktails" sometimes served at parties is effective in increasing mental acuity.

We test the effectiveness of 10 different types of cocktails using the methodology described in this chapter, using a 5% significance level for each test. We use different individuals to test each kind of cocktail; we assume that the outcomes of the tests are independent. This is an example of multiplicity: testing more than one hypothesis simultaneously.

Suppose we go through the protocol, and cocktail X shows up as having a significant effect; that is, we reject the hypothesis that cocktail X has no effect, at significance level 5%. On the face of it, it appears that cocktail X improves mental acuity (the cocktail increases acuity, however we measure it, or an event has occurred that has chance no larger than 5%). It would seem, therefore, that we could reject the null hypothesis that none of the brain cocktails is effective, at significance level 5%—but that is not the correct significance level. The question we need to ask is, "if none of the cocktails really had an effect, what would be the chance of getting at least one positive result?"

The grand null hypothesis is that none of the cocktails makes a difference. The alternative hypothesis is that at least one of them improves mental acuity. If the grand null hypothesis is true, what is the chance that at least one of the tests gives a false positive result?

If the grand null hypothesis is true, the number of false positives would have the same distribution as the number of 1's in 10 draws with replacement from a 0-1 box that has 5 tickets labeled "1" and 95 tickets labeled "0," and adding the draws.

That distribution is binomial, with parameters n = 10 and <p = 5%. The chance that we get at least one ticket labeled "1" is 100% minus the chance that we get no ticket labeled "1":

chance of at least one false positive = 100% − (chance of no false positive)

= 100% − (95%)10

= 100% − 60%

= 40%.

Even though we test the individual hypotheses that each cocktail has no effect at significance level 5%, the resulting significance level for testing the "grand" null hypothesis is 40%.

This is typical of the effect of multiplicity on the significance level: The more tests performed, the greater the chance of a false positive. If you test many hypotheses at, say, 5% significance level, using independent data, you should expect that in the long run you would erroneously reject about 5% of the null hypotheses that are in fact true.

Beware studies that apply many different tests or test many different hypotheses from the same data, and claim a significant result. Often, such studies neglect the effect of multiplicity, and the chance of a false positive is much higher than the authors recognize. Applying many hypothesis tests to the same data in search of a significant result is known in Statistics as "data mining" or "data snooping."

Garbage in, garbage out

Often in science it is the hypothesis rejections that are interesting. Typically, rejecting a null hypothesis means deciding that some effect is present or important; this is called a "discovery." At one extreme, every discovery is true—every rejected null hypothesis is in fact false. At the other extreme, every discovery is false—every rejected null hypothesis is in fact true. Typically, only the "discoveries" are brought to our attention. Few scientists seek to publish negative results. Consequently, we see primarily the rejections in tests of a population of hypotheses of which an unknown fraction really are false. Testing hypotheses at significance level 5% does not mean that 5% of the rejections are erroneous. The fraction of erroneous rejections depends on the fraction of true null hypotheses, and can be anywhere between 0% and 100%, regardless of the significance level of the tests.

Suppose one tests a large collection of hypotheses. Among those that are not rejected, what fraction are true? Among those that are rejected, what fraction are false? This question cannot be answered unless one knows what proportion of null hypotheses tested are false, as simple reasoning shows:

Suppose a fraction t of the null hypotheses tested are in fact true, so the fraction of null hypotheses that are false is (1−t). If t = 100%, every hypothesis that is rejected is rejected erroneously, and every hypothesis that is not rejected is really true—every error we make is a Type I error. At the other extreme, if t = 0, every hypothesis that is rejected is rejected correctly, and every hypothesis that is not rejected is in fact false—every error we make is a Type II error. No matter how good the test is, it cannot make a true hypothesis false, nor a false hypothesis true. Unless the test always rejects, if fed a steady diet of false hypotheses, it will fail to reject some of them. Unless the test never rejects, if fed a steady diet of true hypotheses, it will erroneously reject some of them. These remarks can be summarized as "garbage in, garbage out."

Suppose that every test performed has the same power. The chance that a false null hypothesis is rejected is the power, and the chance that a true null hypothesis is rejected is the significance level, so the long-run fraction of rejected hypotheses that are in fact false is

(1−t)×power
----------------------------------------
t×(significance level) + (1−t)×power

(the numerator is the expected fraction of false hypotheses that are rejected; the denominator is the sum of the expected fraction of false hypotheses that are rejected and the expected fraction of true hypotheses that are rejected). The long-run fraction of hypotheses that are not rejected and that are in fact true is

t × (100% − significance level)
------------------------------------------------------------
t × (100% − significance level) + (1−t) × (100% − power)

(the numerator is the expected fraction of true hypotheses that are not rejected; the denominator is the sum of the expected fraction of true hypotheses that are not rejected and the expected fraction of false hypotheses that are not rejected).

Relatively recently, statisticians have developed hypothesis testing methods that keep the rate of false discoveries under control (see the work of Benjamini, Hochberg, and others). That is, the methods guarantee that the fraction of "discoveries" that are erroneous rejections of true null hypotheses does not exceed a specified limit (such as 5%). These methods control what is called "the false discovery rate." Perhaps those methods will be used commonly in the future. Until then, it is prudent to keep in mind that the proportion of real discoveries among the claims of "discoveries" is unknown. Scientists are much more likely to report positive results than negative ones, and scientific journals are much more likely to publish positive reports than negative ones, so the majority of hypothesis tests that are reported are the "discoveries." The failures to reject null hypotheses are rarely reported. Thus it is plausible that many published "discoveries" are in fact erroneous rejections of null hypotheses.

Summary

Many scientific and practical questions can be posed as decisions between competing hypotheses or theories about the world: a null hypothesis and an alternative hypothesis. Two kinds of error are possible: rejecting a true null hypothesis (a Type I error), and failing to reject a false null hypothesis (a Type II error). If the data on which the decision is based have a random component, the rule is called a statistical hypothesis test or a test of significance. The chance an hypothesis test commits a Type I error is the significance level of the test—the chance of a false alarm. The chance that the test correctly rejects the null hypothesis when a particular alternative hypothesis is true is the power of the test against that alternative. The chance of a Type II error when a particular alternative is true is 100% minus the power against that alternative. When an hypothesis test rejects the null hypothesis, either the null hypothesis is false, or an event occurred that has probability no larger than the significance level of the test. It does not mean that the alternative hypothesis is true. When the null hypothesis is not rejected, it does not mean that the null hypothesis is true. In particular, if the data are few, irrelevant, or poor, any test will have trouble rejecting a false null hypothesis. Given a family of hypothesis tests that allow a null hypothesis to be tested at any significance level between 0 and 100%, the P-value of the null hypothesis for the observed data is the smallest significance level for which any of the tests would reject the null hypothesis.

Statistical significance should not be confused with practical importance. Statistical significance has to do with detectability, which depends on the number and quality of data, among other things. The significance level or P-value is not the probabiity that the null hypothesis is true. In fact, the P-value and significance level are both computed using the assumption that the null hypothesis is true. The power is computed using the assumption that the alternative hypothesis is true.

Hypothesis tests usually are based on a test statistic: a random variable computed from the data. If the test statistic is in the rejection region, the null hypothesis is rejected. The rejection region is chosen subject to the constraint that if the null hypothesis is true, the chance that the test statistic will be in the rejection region is at most the significance level, so that the test has the desired significance level. The rejection region also should also be chosen so that the test will have good power against the alternatives that are contemplated: The chance that the test statistic is in the rejection region should be higher when the alternative hypothesis is true than when the null hypothesis is true. For significance levels to be meaningful, the null hypothesis, test statistic, significance level, and rejection region all must be chosen before the data are collected.

Many hypotheses can be written in terms of the value of a parameter μ, the null hypothesis being that μ = μ0, the null value; and the alternative hypothesis being that μ≠μ0 (a two-sided alternative hypothesis), that μ>μ0 (a one-sided alternative hypothesis), or that μ<μ0 (a one-sided alternative hypothesis). Hypothesis tests designed to have good power against two-sided alternatives are called two-sided tests; tests designed to have good power only against one-sided alternatives are called one-sided tests.

The significance level of a test controls the long-run fraction of true null hypotheses that are rejected erroneously, but not the fraction of rejected null hypotheses that are rejected erroneously: That fraction depends on the significance level, the power, and on the fraction of null hypotheses that are true. If every tested null null hypothesies tested is false, every rejection is a correct rejection; if no tested null hypothesis tested is false, every rejection is an erroneous rejection, no matter how good the test. When many hypotheses are tested, the there is a the chance of that at least one Type I error will occur is much larger than the chance of a Type I error in each test; this is the issue of multiplicity. Because only rejections of null hypotheses tend to be reported in scientific literature (as "discoveries"), it is likely that a noticeable fraction of scientific results are Type I errors—false discoveries.

Key Terms