One of the most common tests in statistics is the t-test, used to determine whether the means of two groups are equal to each other. The assumption for the test is that both groups are sampled from normal distributions with equal variances. The null hypothesis is that the two means are equal, and the alternative is that they are not. It is known that under the null hypothesis, we can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees of freedom. There is also a widely used modification of the t-test, known as Welch's t-test that adjusts the number of degrees of freedom when the variances are thought not to be equal to each other. Before we can explore the test much further, we need to find an easy way to calculate the t-statistic.

The function `t.test` is available in R for performing t-tests.
Let's test it out on a simple example, using data simulated from a normal
distribution.

> x = rnorm(10) > y = rnorm(10) > t.test(x,y) Welch Two Sample t-test data: x and y t = 1.4896, df = 15.481, p-value = 0.1564 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3221869 1.8310421 sample estimates: mean of x mean of y 0.1944866 -0.5599410

Before we can use this function in a simulation, we need to find out how to
extract the t-statistic (or some other quantity of interest) from the output
of the `t.test` function. For this function, the R help page has a
detailed list of what the object returned by the function contains. A general
method for a situation like this is to use the `class` and `names`
functions to find where the quantity of interest is. In addition, for some
hypothesis tests, you may need to pass the object from the hypothesis test to
the `summary` function and examine its contents. For `t.test` it's
easy to figure out what we want:

> ttest = t.test(x,y) > names(ttest) [1] "statistic" "parameter" "p.value" "conf.int" "estimate" [6] "null.value" "alternative" "method" "data.name"

The value we want is named "`statistic`". To extract it, we can use
the dollar sign notation, or double square brackets:

> ttest$statistic t 1.489560 > ttest[['statistic']] t 1.489560

Of course, just one value doesn't let us do very much - we need to generate
many such statistics before we can look at their properties. In R, the
`replicate` function makes this very simple. The first argument to
`replicate` is the number of samples you want, and the second argument
is an expression (not a function name or definition!) that will generate one of
the samples you want. To generate 1000 t-statistics from testing two groups of
10 standard random normal numbers, we can use:

> ts = replicate(1000,t.test(rnorm(10),rnorm(10))$statistic)

Under the assumptions of normality and equal variance, we're assuming that the statistic will have a t-distribution with 10 + 10 - 2 = 18 degrees of freedom. (Each observation contributes a degree of freedom, but we lose two because we have to estimate the mean of each group.) How can we test if that is true?

One way is to plot the theoretical density of the t-statistic we should be seeing, and superimposing the density of our sample on top of it. To get an idea of what range of x values we should use for the theoretical density, we can view the range of our simulated data:

> range(ts) > range(ts) [1] -4.564359 4.111245

Since the distribution is supposed to be symmetric, we'll use a range from
-4.5 to 4.5. We can generate equally spaced x-values in this range with
`seq`:

> pts = seq(-4.5,4.5,length=100) > plot(pts,dt(pts,df=18),col='red',type='l')

Now we can add a line to the plot showing the density for our simulated sample:

> lines(density(ts))

The plot appears below.

Another way to compare two densities is with a quantile-quantile plot. In
this type of plot, the quantiles of two samples are calculated at a variety of
points in the range of 0 to 1, and then are plotted against each other. If the
two samples came from the same distribution with the same parameters, we'd see
a straight line through the origin with a slope of 1; in other words, we're
testing to see if various quantiles of the data are identical in the two
samples. If the two samples came from similar distributions, but their
parameters were different, we'd still see a straight line, but not through the
origin. For this reason, it's very common to draw a straight line through the
origin with a slope of 1 on plots like this. We can produce a
quantile-quantile plot (or QQ plot as they are commonly known), using the
`qqplot` function. To use qqplot, pass it two vectors that contain the
samples that you want to compare. When comparing to a theoretical
distribution, you can pass a random sample from that distribution. Here's a QQ
plot for the simulated t-test data:

> qqplot(ts,rt(1000,df=18)) > abline(0,1)

We can see that the central points of the graph seems to agree fairly well, but there
are some discrepancies near the tails (the extreme values on either end of the
distribution). The tails of a distribution are the most
difficult part to accurately measure, which is unfortunate, since those are often the
values that interest us most, that is, the ones which will provide us with enough
evidence to reject a null hypothesis. Because the tails of a distribution are so
important, another way to test to see if a distribution of a sample follows some
hypothesized distribution is to calculate the quantiles of some tail probabilities
(using the `quantile` function) and compare them to the theoretical probabilities
from the distribution (obtained from the function for that distribution whose first
letter is "q"). Here's such a comparison for our simulated data:

> probs = c(.9,.95,.99) > quantile(ts,probs) 90% 95% 99% 1.427233 1.704769 2.513755 > qt(probs,df=18) [1] 1.330391 1.734064 2.552380

The quantiles agree fairly well, especially at the .95 and .99 quantiles. Performing more simulations, or using a large sample size for the two groups would probably result in values even closer to what we have theoretically predicted.

One final method for comparing distributions is worth mentioning. We noted previously
that one of the assumptions for the t-test is that the variances of the two samples
are equal. However, a modification of the t-test known as Welch's test is said to
correct for this problem by estimating the variances, and adjusting the degrees of
freedom to use in the test. This correction is performed by default, but can be
shut off by using the `var.equal=TRUE` argument. Let's see how it works:

> t.test(x,y) Welch Two Sample t-test data: x and y t = -0.8103, df = 17.277, p-value = 0.4288 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.0012220 0.4450895 sample estimates: mean of x mean of y 0.2216045 0.4996707 > t.test(x,y,var.equal=TRUE) Two Sample t-test data: x and y t = -0.8103, df = 18, p-value = 0.4284 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.9990520 0.4429196 sample estimates: mean of x mean of y 0.2216045 0.4996707

Since the statistic is the same in both cases, it doesn't matter whether we use the correction or not; either way we'll see identical results when we compare the two methods using the techniques we've already described. Since the degree of freedom correction changes depending on the data, we can't simply perform the simulation and compare it to a different number of degrees of freedom. The other thing that changes when we apply the correction is the p-value that we would use to decide if there's enough evidence to reject the null hypothesis. What is the behaviour of the p-values? While not necessarily immediately obvious, under the null hypothesis, the p-values for any statistical test should form a uniform distribution between 0 and 1; that is, any value in the interval 0 to 1 is just as likely to occur as any other value. For a uniform distribution, the quantile function is just the identity function. A value of .5 is greater than 50% of the data; a value of .95 is greater than 95% of the data. As a quick check of this notion, let's look at the density of probability values when the null hypothesis is true:

> tps = replicate(1000,t.test(rnorm(10),rnorm(10))$p.value) > plot(density(tps))

The graph appears below.

Another way to check to see if the probabilities follow a uniform distribution is with a QQ plot:

> qqplot(tps,runif(1000)) > abline(0,1)

The graph appears below.

The idea that the probabilities follow a uniform distribution seems reasonable.

Now, let's look at some of the quantiles of the p-values when we force
the `t.test` function to use `var.equal=TRUE`:

> tps = replicate(1000,t.test(rnorm(10),rnorm(10),var.equal=TRUE)$p.value) > probs = c(.5,.7,.9,.95,.99) > quantile(tps,probs) 50% 70% 90% 95% 99% 0.4873799 0.7094591 0.9043601 0.9501658 0.9927435

The agreement actually looks very good. What about when we let `t.test`
decide whether to make the correction or not?

> tps = replicate(1000,t.test(rnorm(10),rnorm(10))$p.value) > quantile(tps,probs) 50% 70% 90% 95% 99% 0.4932319 0.7084562 0.9036533 0.9518775 0.9889234

There's not that much of a difference, but, of course, the variances in this example were equal. How does the correction work when the variances are not equal?

> tps = replicate(1000,t.test(rnorm(10),rnorm(10,sd=5),var.equal=TRUE)$p.value) > quantile(tps,probs) 50% 70% 90% 95% 99% 0.5221698 0.6926466 0.8859266 0.9490947 0.9935562 > tps = replicate(1000,t.test(rnorm(10),rnorm(10,sd=5))$p.value) > quantile(tps,probs) 50% 70% 90% 95% 99% 0.4880855 0.7049834 0.8973062 0.9494358 0.9907219

There is an improvement, but not so dramatic.

Of course, all of this is concerned with the null hypothesis. Now let's
start to investigate the power of the t-test. With a sample size of 10, we
obviously aren't going to expect truly great performance, so let's consider a
case that's not too subtle. When we don't specify a standard deviation for
`rnorm` it uses a standard deviation of 1. That means about 68% of the
data will fall in the range of -1 to 1. Suppose we have a difference in means
equal to just one standard deviation, and we want to calculate the power for
detecting that difference. We can follow the same procedure as the coin
tossing experiment: specify an alpha level, calculate the rejection region,
simulate data under the alternative hypothesis, and see how many times we'd
reject the null hypothesis. As in the coin toss example, a function will make
things much easier:

t.power = function(nsamp=c(10,10),nsim=1000,means=c(0,0),sds=c(1,1)){ lower = qt(.025,df=sum(nsamp) - 2) upper = qt(.975,df=sum(nsamp) - 2) ts = replicate(nsim, t.test(rnorm(nsamp[1],mean=means[1],sd=sds[1]), rnorm(nsamp[2],mean=means[2],sd=sds[2]))$statistic) sum(ts < lower | ts > upper) / nsim }

Let's try it with our simple example:

> t.power(means=c(0,1)) [1] 0.555

Not bad for a sample size of 10!

Of course, if the differences in means are smaller, it's going to be harder to reject the null hypothesis:

> t.power(means=c(0,.3)) [1] 0.104

How large a sample size would we need to detect that difference of .3 with 95% power?

> samps = c(100,200,300,400,500) > res = sapply(samps,function(n)t.power(means=c(0,.3),nsamp=c(n,n))) > names(res) = samps > res 100 200 300 400 500 0.567 0.841 0.947 0.992 0.999

It would take over 300 samples in each group to be able to detect such a difference.

Now we can return to the issue of unequal variances. We saw that Welch's adjustment to the degrees of freedom helped a little bit under the null hypothesis. Now let's see if the power of the test is improved using Welch's test when the variances are unequal. To do this, we'll need to modify our `t.power` function a little:

t.power1 = function(nsamp=c(10,10),nsim=1000,means=c(0,0),sds=c(1,1),var.equal=TRUE){ tps = replicate(nsim, t.test(rnorm(nsamp[1],mean=means[1],sd=sds[1]), rnorm(nsamp[2],mean=means[2],sd=sds[2]))$p.value) sum(tps < .025 | tps > .975) / nsim }

Since I set `var.equal=TRUE` by default, Welch's adjustment will not
be used unless we specify `var.equal=FALSE`. Let's see what the power
is for a sample of size 10, assuming the mean of one of the groups is 1, and
its standard deviation is 2, while the other group is left at the default of
mean=0 and sd=1:

> t.power1(nsim=10000,sds=c(1,2),mean=c(1,2)) [1] 0.1767 > t.power1(nsim=10000,sds=c(1,2),mean=c(1,2),var.equal=FALSE) [1] 0.1833

There does seem to be an improvement, but not so dramatic.

We can look at the same thing for a variety of sample sizes:

> res1 = sapply(sizes,function(n)t.power1(nsim=10000,sds=c(1,2), + mean=c(1,2),nsamp=c(n,n))) > names(res1) = sizes > res1 10 20 50 100 0.1792 0.3723 0.8044 0.9830 > res2 = sapply(sizes,function(n)t.power1(nsim=10000,sds=c(1,2), + mean=c(1,2),nsamp=c(n,n),var.equal=FALSE)) > names(res2) = sizes > res2 10 20 50 100 0.1853 0.3741 0.8188 0.9868