Hypothesis Testing

1  Hypothesis Testing

Much of classical statistics is concerned with the idea of hypothesis testing. This is a formal framework that we can use to pose questions about a variety of topics in a consistent form that lets us apply statistical techniques to make statements about how results that we've gathered relate to questions that we're interested in. If we carefully follow the rules of hypothesis testing, then we can confidently talk about the probability of events that we've observed under a variety of hypotheses, and put our faith in those hypotheses that seem the most reasonable. Central to the idea of hypothesis testing is the notion of null and alternative hypotheses. We want to present our question in the form of a statement that assumes that nothing has happened (the null hypothesis) versus a statement that describes a specific relation or condition that might better explain our data.
For example, suppose we've got a coin, and we want to find out if it's true, that is, if, when we flip the coin, are we as likely to see heads as we are to see tails. A null hypothesis for this situation could be
H0: We're just as likely to get heads as tails when we flip the coin.
A suitable alternative hypothesis might be:
Ha: We're more likely to see either heads or tails when we flip the coin.
An alternative hypothesis such as this is called two-sided, since we'll reject the null hypothesis if heads are more likely or if tails are more likely. We could use a one-sided alternative hypothesis like "We're more likely to see tails than heads," but unless you've got a good reason to believe that it's absolutely impossible to see more heads than tails, we usually stick to two-sided alternative hypotheses.
Now we need to perform an experiment in order to help test our hypothesis. Obviously, tossing the coin once or twice and counting up the results won't help very much. Suppose I say that in order to test the null hypothesis that heads are just as likely as tails, I'm going to toss the coin 100 times and record the results. Carrying out this experiment, I find that I saw 55 heads and 45 tails. Can I safely say that the coin is true? Without some further guidelines, it would be very hard to say if this deviation from 50/50 really provides much evidence one way or the other. To proceed any further, we have to have some notion of what we'd expect to see in the long run if the null hypothesis was true. Then, we can come up with some rule to use in deciding if the coin is true or not, depending on how willing we are to make a mistake. Sadly, we can't make statements with complete certainty, because there's always a chance that, even if the coin was true, we'd happen to see more heads than tails, or, conversely, if the coin was weighted, we might just happen to see nearly equal numbers of heads and tails when we tossed the coin many times. The way we come up with our rule is by stating some assumptions about what we'd expect to see if the null hypothesis was true, and then making a decision rule based on those assumptions. For tossing a fair coin (which is what the null hypothesis states), most statisticians agree that the number of heads (or tails) that we would expect follows what is called a binomial distribution. This distribution takes two parameters: the theoretical probability of the event in question (let's say the event of getting a head when we toss the coin), and the number of times we toss the coin. Under the null hypothesis, the probability is .5. In R, we can see the expected probability of getting any particular number of heads if we tossed a coin 100 times by plotting the density function for the binomial distribution with parameters 100 and .5 .
> barplot(dbinom(1:100,100,.5),names.arg=1:100)

As you'd probably expect, the most common value we'd see would be 50, but there are lots of reasonable values that aren't exactly 50 that would also occur pretty often. So to make a decision rule for when we will reject the null hypothesis, we would first decide how often we're willing to reject the null hypothesis when it was actually true. For example, looking at the graph, there's a tiny probability that we'd see as few as 30 heads or as many as 70 heads when we tossed a true coin 100 times. But if we tossed a coin 100 times and only saw 30 heads, we'd be wise if we said the coin wasn't true because 30 heads out of 100 is such a rare event. On the other hand, we can see on the graph that the probability of getting, say 45 heads is still pretty high, and we might very well want to accept the null hypothesis in this case. Without any compelling reason to choose otherwise, people usually will accept an error of 5% when performing a hypothesis test, although you might want to alter that value in situations when making such an error is more or less of a problem than usual. With a two-sided hypothesis like this one, to come up with a decision rule that would be wrong only 5% of the time, we'd need to find the number of heads for which we'd see fewer heads 2.5% of the time, and the number of heads for which we'd see more heads 2.5% of the time. To find quantities like this for the binomial distribution, we can use the qbinom function. The lower limit we want is the result of calling qbinom with an argument of 0.025, and the upper limit is the result of calling qbinom with an argument of 0.975:
> qbinom(.025,100,.5)
[1] 40
> qbinom(.975,100,.5)
[1] 60

So in this experiment, if the number of heads we saw was between 40 and 60 (out of 100 tosses), we would tentatively accept the null hypothesis; formally we would say that there's not enough evidence to reject the null hypothesis. If we saw fewer than 40 or more than 60 tosses, we would say we had enough evidence to reject the null hypothesis, knowing that we would be wrong only 5% of the time. When we reject the null hypothesis when it was actually true, it's said to be a Type I error. So in this example, we're setting the Type I error rate to 5%.
To summarize, the steps for performing a hypothesis test are:
  1. Describe a null hypothesis and an alternative hypothesis.
  2. Specify a significance or alpha level for the hypothesis test. This is the percent of the time that you're willing to be wrong when you reject the null hypothesis.
  3. Formulate some assumptions about the distribution of the statistic that's involved in the hypothesis test. In this example we made the assumption that a fair coin follows a binomial distribution with p=0.50.
  4. Using the assumptions you made and the alpha level you decided on, construct the rejection region for the test, that is, the values of the statistic for which you'll be willing to reject the null hypothesis. In this example, the rejection region is broken up into two sections: less than 40 heads and more than 60 heads.
Although it's not always presented in this way, one useful way to think about hypothesis testing is that if you follow the above steps, and your test statistic is in the rejection region then either the null hypothesis is not reasonable or your assumptions are wrong. By following this framework, we can test a wide variety of hypotheses, and it's always clear exactly what we're assuming about the data.
Most of hypothesis testing is focused on the null hypothesis, and on possibly rejecting it. There are two main reasons for this. First, there is only one null hypothesis, so we don't have to consider a variety of possibilities. Second, as we've seen, it's very easy to control the error rate when rejecting the null hypothesis. The other side of the coin has to do with the case where the null hypothesis is not true. This case is much more complicated because there are many different ways to reject the null hypothesis, and it would be naive to believe that they'd all behave the same way. When the null hypothesis is not true, we might mistakenly accept the null hypothesis anyway. Just as we might see an unusually high number of heads for a true coin, we might coincidentally see nearly equal numbers of heads and tails with a coin that was actually weighted. The case where the null hypothesis is not true, but we fail to reject it results in a Type II error. Type II error is usually expressed as power, which is 1 minus the probability of making an error. Power is the proportion of the time that you correctly reject the null hypothesis when it isn't true. Notice that the power of a hypothesis test changes depending on what the alternative is. In other words, it makes sense that if we had a coin that was so heavily weighted that it usually had a probability of 0.8 of getting heads, we'd have more power in detecting the discrepancy from the null distribution than if our coin was only weighted to give a 0.55 probability of heads. Also, unlike Type I error, we can't set a Type II error rate before we start the analysis - the only way to change the Type II error rate is to get more samples or redesign the experiment.

2  Determining Power

For many years, the only way to answer questions about power required complex mathematical computations, and power was often ignored because sometimes it was too difficult or even impossible to calculate. Even if it was easy to calculate, the fact that power varies depending on the alternative hypothesis would make considering power a much more difficult task than thinking about Type I error. But more recently, statisticians have come to realize that it's often a good idea to estimate power for alternative hypotheses of interest.
Using computers, we can calculate the power of any alternative hypothesis we please by doing the following:
  1. Follow the general hypothesis testing procedure outlined above.
  2. Repeatedly generate simulated data which corresponds to an alternative hypothesis that you're interested in. For each set of data that you simulate (sometimes refered to as a trial), calculate the statistic, and decide whether to reject the null hypothesis or not.
  3. When you're done with all the trials, count how often you (correctly) rejected the null hypothesis, and divide by the total number of trials to obtain the power of the experiment for that specific alternative distribution.
To generate random values from the binomial distribution, to simulate a coin toss, we can use the rbinom function. You pass this function the number of samples you'd like, the size of the sample, and the probability. Before diving into the power computation, let's verify that, when the null hypothesis is true, using the rejection region we developed previously will actually result in a Type I error rate of 5%. First, let's simulate a single set of 100 tosses a few times to see if it makes sense
> rbinom(1,100,.5)
[1] 45
> rbinom(1,100,.5)
[1] 52
> rbinom(1,100,.5)
[1] 52
> rbinom(1,100,.5)
[1] 46

That seems reasonable. Now let's generate 1000 such samples, and count how many of them had less than 40 heads or more than 60 heads:
> rr = rbinom(1000,100,.5);sum(rr<40|rr>60)/1000
[1] 0.042
> rr = rbinom(1000,100,.5);sum(rr<40|rr>60)/1000
[1] 0.03
> rr = rbinom(1000,100,.5);sum(rr<40|rr>60)/1000
[1] 0.043
> rr = rbinom(1000,100,.5);sum(rr<40|rr>60)/1000
[1] 0.029

The values are a little low, but it's because we've got a discrete distribution and we had to choose the nearest integer value of heads. To see the exact probabilities that the values 40 and 60 represent, we can use the pbinom function:
> pbinom(40,100,.5)
[1] 0.02844397
> pbinom(60,100,.5)
[1] 0.9823999

If we make the rejection region larger, then the probability will become too high:
> rr = rbinom(1000,100,.5);sum(rr<41|rr>59)/1000
[1] 0.054
> rr = rbinom(1000,100,.5);sum(rr<41|rr>59)/1000
[1] 0.064
> rr = rbinom(1000,100,.5);sum(rr<41|rr>59)/1000
[1] 0.056
> rr = rbinom(1000,100,.5);sum(rr<41|rr>59)/1000
[1] 0.068
> rr = rbinom(1000,100,.5);sum(rr<41|rr>59)/1000
[1] 0.064

So it seems that the idea of using less than 40 or greater than 60 as a rejection region will give us a Type I error as close to 5% as is possible.
Now let's ask a much more interesting question. If I had a coin that really had a 55% chance of heads instead of a 50% chance, and I tossed the coin 100 times, how often would I correctly reject the null hypothesis? In other words, what is the power of the test against an alternative that the actual chances of getting heads is 55%? I can do exactly the same thing:
> rr = rbinom(1000,100,.55);sum(rr<40|rr>60)/1000
[1] 0.139
> rr = rbinom(1000,100,.55);sum(rr<40|rr>60)/1000
[1] 0.112
> rr = rbinom(1000,100,.55);sum(rr<40|rr>60)/1000
[1] 0.134
> rr = rbinom(1000,100,.55);sum(rr<40|rr>60)/1000
[1] 0.133

We'd only correctly identify the coin as being untrue in about 13% of the cases. It's better than the 5% we'd expect when the coin was true, but not very impressive. Maybe tossing more coins would help. Let's see what happens to the power when we do 500 tosses instead of just 100. First we need to find the correct rejection region:
> qbinom(.975,500,.5)
[1] 272
> qbinom(.025,500,.5)
[1] 228

Now we can try the simulation:
> rr = rbinom(1000,500,.55);sum(rr<228|rr>272)/1000
[1] 0.595
> rr = rbinom(1000,500,.55);sum(rr<228|rr>272)/1000
[1] 0.587
> rr = rbinom(1000,500,.55);sum(rr<228|rr>272)/1000
[1] 0.577
> rr = rbinom(1000,500,.55);sum(rr<228|rr>272)/1000
[1] 0.61
> rr = rbinom(1000,500,.55);sum(rr<228|rr>272)/1000
[1] 0.611

That's quite a bit better. But suppose we wanted to find out how big of a sample size we'd need to be able to detect the 55% coin 95% of the time. Probably the easiest way is to try a bunch of different numbers in the simulation. It's getting complicated enough that we should probably write a function:
> coin.power = function(ntoss=100,nsim=1000,prob=.5){
+     lower = qbinom(.025,ntoss,.5)
+     upper = qbinom(.975,ntoss,.5)
+     rr = rbinom(nsim,ntoss,prob)
+     sum(rr < lower | rr > upper) / nsim
+ }

We can use sapply to call the coin.power function for a variety of number of tosses:
> ntosses = c(10,100,200,500,600,800,1000,1500,2000,2500)
> res = sapply(ntosses,coin.power,prob=.55)
> names(res) = ntosses
> res
   10   100   200   500   600   800  1000  1500  2000  2500
0.040 0.135 0.289 0.583 0.697 0.773 0.884 0.967 0.989 0.999

Intuitively, if the coin was more dishonest, we'd expect our test to have better power. Let's try the same thing with a coin that's weighted to give 70% heads:
> res = sapply(ntosses,coin.power,prob=.70)
> names(res) = ntosses
> res
   10   100   200   500   600   800  1000  1500  2000  2500
0.153 0.986 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

With just 100 tosses, we're pretty much guaranteed to be able to recognize the coin as weighted. What if the coin was just barely weighted, say 51% instead of 50%:
> res = sapply(ntosses,coin.power,prob=.51)
> names(res) = ntosses
> res
   10   100   200   500   600   800  1000  1500  2000  2500
0.019 0.035 0.051 0.067 0.071 0.087 0.084 0.129 0.125 0.158

Even after 2500 tosses, we'd only have about a 16% chance of detecting the weighted coin.

3  Probability Distributions

The naming conventions that were used for the binomial probability functions in the previous section are used consistently with all of the different distributions that are supported by R. To summarize:
  1. To get an overall idea of how values are distributed, we can examine (usually graphically), the probability density function for a distribution. For example, the familiar bell curve is the density function for a normal distribution. In R, density functions begin with the letter "d".
  2. To find out what value corresponds to a particular probability (like 0.025 and 0.975 in the previous example), we can call the quantile function for the distribution. These functions accept an argument between 0 and 1 (i.e. a probability) and return the value for that distribution where the probability that an observation would be smaller than that value is equal to the probability passed to the function. When you perform a statistical test, choose a significance level, and look up the statistic's value in a table, you're actually using the quantile function. Quantile functions in R begin with the letter "q".
  3. If you've got a value for an observation that comes from a particular distribution, and you want to know the probability that observations from the distribution would be less than that value, you can pass the value to a probability function (like pbinom) and it will return the probability. In R, these functions begin with the letter "p".
  4. When you want to generate random numbers from a distribution, you can call the appropriate random number generator. Random number generators in R begin with the letter "r".
The following table shows the distributions for which these four types of functions are available in R. In each case, precede the name in the table with a d, q, p, or r to get the function you want.
betaBetabinomBinomial
cauchyCauchychisqChi-squared
expExponentialexpExponential
fFgammaGamma
geomGeometrichyperHypergeometric
lnormLognormallogisLogistic
nbinomNegative BinomialnormNormal
poisPoissonsignrankSigned Rank
tStudent's tunifUniform
weibullWeibullwilcoxWilcoxon's Rank Sum

4  A Note about Random Numbers in R

Some people like to make a distinction between truly random numbers (like those formed from physically tossing dice or counting the number of clicks of a Geiger counter in a given time interval) and the kinds of random numbers that are generated on a computer. (If you think the second example is impractical, take a look at http://www.fourmilab.ch/hotbits/. Another source of truly random numbers can be found at http://www.lavarnd.org.)
Some chip manufacturers are developing chips which allow noise in the circuitry (which is truly random) to be measured and used as a source of random numbers. But the programs in R that generate random numbers produce them by applying a reproducible set of rules (a computer program), so they are not really random, and some people refer to them as psuedo-random numbers for that reason. In general, we won't make any such distinction between truly random numbers and the ones we get from the computer.
In R and most other computer programs, all of the random number generators are based on the uniform distribution. By developing programs that are guaranteed to give random numbers with the right distribution if they are provided by a uniform random number generator, only one distribution (the uniform) needs to be studied to insure that it's truly random. The uniform random number generator in R has been extensively tested, so you can have confidence that all the random number generators in R are reliable.
While it's obviously a useful feature that R's default behaviour is to give you a different set of random numbers each time that you call any of the random number generators, sometimes it creates problems or misunderstandings. Because of this, R provides a simple way to set the random number generators to a reproducible state, so that the exact same sequence of random numbers will be generated every time a program is run. By passing an integer to the set.seed function, you can insure that the sequence of random numbers that are generated will be the same everytime you run an identical program.


File translated from TEX by TTH, version 3.85.
On 4 Apr 2011, 16:59.