Hypothesis Testing

1 Hypothesis Testing

Much of classical statistics is concerned with the idea of hypothesis testing. This is a formal framework that we can use to pose questions about a variety of topics in a consistent form that lets us apply statistical techniques to make statements about how results that we've gathered relate to questions that we're interested in. If we carefully follow the rules of hypothesis testing, then we can confidently talk about the probability of events that we've observed under a variety of hypotheses, and put our faith in those hypotheses that seem the most reasonable. Central to the idea of hypothesis testing is the notion of null and alternative hypotheses. We want to present our question in the form of a statement that assumes that nothing has happened (the null hypothesis) versus a statement that describes a specific relation or condition that might better explain our data.

For example, suppose we've got a coin, and we want to find out if it's true, that is, if, when we flip the coin, are we as likely to see heads as we are to see tails. A null hypothesis for this situation could be

H₀: We're just as likely to get heads as tails when we flip the coin.

A suitable alternative hypothesis might be:

H_a: We're more likely to see either heads or tails when we flip the coin.

An alternative hypothesis such as this is called two-sided, since we'll reject the null hypothesis if heads are more likely or if tails are more likely. We could use a one-sided alternative hypothesis like "We're more likely to see tails than heads," but unless you've got a good reason to believe that it's absolutely impossible to see more heads than tails, we usually stick to two-sided alternative hypotheses.

Now we need to perform an experiment in order to help test our hypothesis. Obviously, tossing the coin once or twice and counting up the results won't help very much. Suppose I say that in order to test the null hypothesis that heads are just as likely as tails, I'm going to toss the coin 100 times and record the results. Carrying out this experiment, I find that I saw 55 heads and 45 tails. Can I safely say that the coin is true? Without some further guidelines, it would be very hard to say if this deviation from 50/50 really provides much evidence one way or the other. To proceed any further, we have to have some notion of what we'd expect to see in the long run if the null hypothesis was true. Then, we can come up with some rule to use in deciding if the coin is true or not, depending on how willing we are to make a mistake. Sadly, we can't make statements with complete certainty, because there's always a chance that, even if the coin was true, we'd happen to see more heads than tails, or, conversely, if the coin was weighted, we might just happen to see nearly equal numbers of heads and tails when we tossed the coin many times. The way we come up with our rule is by stating some assumptions about what we'd expect to see if the null hypothesis was true, and then making a decision rule based on those assumptions. For tossing a fair coin (which is what the null hypothesis states), most statisticians agree that the number of heads (or tails) that we would expect follows what is called a binomial distribution. This distribution takes two parameters: the theoretical probability of the event in question (let's say the event of getting a head when we toss the coin), and the number of times we toss the coin. Under the null hypothesis, the probability is .5. In R, we can see the expected probability of getting any particular number of heads if we tossed a coin 100 times by plotting the density function for the binomial distribution with parameters 100 and .5 .

> barplot(dbinom(1:100,100,.5),names.arg=1:100)

As you'd probably expect, the most common value we'd see would be 50, but there are lots of reasonable values that aren't exactly 50 that would also occur pretty often. So to make a decision rule for when we will reject the null hypothesis, we would first decide how often we're willing to reject the null hypothesis when it was actually true. For example, looking at the graph, there's a tiny probability that we'd see as few as 30 heads or as many as 70 heads when we tossed a true coin 100 times. But if we tossed a coin 100 times and only saw 30 heads, we'd be wise if we said the coin wasn't true because 30 heads out of 100 is such a rare event. On the other hand, we can see on the graph that the probability of getting, say 45 heads is still pretty high, and we might very well want to accept the null hypothesis in this case. Without any compelling reason to choose otherwise, people usually will accept an error of 5% when performing a hypothesis test, although you might want to alter that value in situations when making such an error is more or less of a problem than usual. With a two-sided hypothesis like this one, to come up with a decision rule that would be wrong only 5% of the time, we'd need to find the number of heads for which we'd see fewer heads 2.5% of the time, and the number of heads for which we'd see more heads 2.5% of the time. To find quantities like this for the binomial distribution, we can use the qbinom function. The lower limit we want is the result of calling qbinom with an argument of 0.025, and the upper limit is the result of calling qbinom with an argument of 0.975:

> qbinom(.025,100,.5)
[1] 40
> qbinom(.975,100,.5)
[1] 60

So in this experiment, if the number of heads we saw was between 40 and 60 (out of 100 tosses), we would tentatively accept the null hypothesis; formally we would say that there's not enough evidence to reject the null hypothesis. If we saw fewer than 40 or more than 60 tosses, we would say we had enough evidence to reject the null hypothesis, knowing that we would be wrong only 5% of the time. When we reject the null hypothesis when it was actually true, it's said to be a Type I error. So in this example, we're setting the Type I error rate to 5%.

To summarize, the steps for performing a hypothesis test are:

Describe a null hypothesis and an alternative hypothesis.
Specify a significance or alpha level for the hypothesis test. This is the percent of the time that you're willing to be wrong when you reject the null hypothesis.
Formulate some assumptions about the distribution of the statistic that's involved in the hypothesis test. In this example we made the assumption that a fair coin follows a binomial distribution with p=0.50.
Using the assumptions you made and the alpha level you decided on, construct the rejection region for the test, that is, the values of the statistic for which you'll be willing to reject the null hypothesis. In this example, the rejection region is broken up into two sections: less than 40 heads and more than 60 heads.

Although it's not always presented in this way, one useful way to think about hypothesis testing is that if you follow the above steps, and your test statistic is in the rejection region then either the null hypothesis is not reasonable or your assumptions are wrong. By following this framework, we can test a wide variety of hypotheses, and it's always clear exactly what we're assuming about the data.

Most of hypothesis testing is focused on the null hypothesis, and on possibly rejecting it. There are two main reasons for this. First, there is only one null hypothesis, so we don't have to consider a variety of possibilities. Second, as we've seen, it's very easy to control the error rate when rejecting the null hypothesis. The other side of the coin has to do with the case where the null hypothesis is not true. This case is much more complicated because there are many different ways to reject the null hypothesis, and it would be naive to believe that they'd all behave the same way. When the null hypothesis is not true, we might mistakenly accept the null hypothesis anyway. Just as we might see an unusually high number of heads for a true coin, we might coincidentally see nearly equal numbers of heads and tails with a coin that was actually weighted. The case where the null hypothesis is not true, but we fail to reject it results in a Type II error. Type II error is usually expressed as power, which is 1 minus the probability of making an error. Power is the proportion of the time that you correctly reject the null hypothesis when it isn't true. Notice that the power of a hypothesis test changes depending on what the alternative is. In other words, it makes sense that if we had a coin that was so heavily weighted that it usually had a probability of 0.8 of getting heads, we'd have more power in detecting the discrepancy from the null distribution than if our coin was only weighted to give a 0.55 probability of heads. Also, unlike Type I error, we can't set a Type II error rate before we start the experiment - the only way to change the Type II error rate is to get more samples or redesign the experiment.

File translated from T_EX by T_TH, version 3.67.
On 1 Apr 2009, 15:57.