This chapter presents two common tests of the hypothesis that a population mean equals a particular value and of the hypothesis that two population means are equal: the z test and the t test. These tests are approximate: They are based on approximations to the probability distribution of the test statistic when the null hypothesis is true, so their significance levels are not exactly what they claim to be. If the sample size is reasonably large and the population from which the sample is drawn has a nearly normal distribution—a notion defined in this chapter—the nominal significance levels of the tests are close to their actual significance levels. If these conditions are not met, the significance levels of the approximate tests can differ substantially from their nominal values. The z test is based on the normal approximation; the t test is based on Student's t curve, which approximates some probability histograms better than the normal curve does. The chapter also presents the deep connection between hypothesis tests and confidence intervals, and shows how to compute approximate confidence intervals for the population mean of nearly normal populations using Student's t-curve.

z tests

In we constructed the z test for equality of two percentages using independent random samples from the two populations. The original test statistic was the difference $\phi^{t-c}$ between the two independent sample percentages. If the null hypothesis that the two population means are equal is true, the expected value of the test statistic, $E(\phi^{t-c})$, is zero. If, in addition, the sample sizes are large, we can estimate $SE(\phi^{t-c})$ accurately using the pooled bootstrap estimate of the SD of the "null box":

where $\phi$ is the pooled sample percentage of the two samples. The estimate of $SE(\phi^{t-c})$ under the null hypothesis is

where $n_t$ and $n_c$ are the sizes of the two samples. If the null hypothesis is true, the Z statistic,

This strategy—transforming a test statistic approximately to standard units under the assumption that the null hypothesisis true, and then using the normal approximation to determine the rejection region for the test—works to construct approximate hypothesis tests in many other situations, too. The resulting hypothesis test is called a z test. Suppose that we are testing a null hypothesis using a test statistic $X$, and the following conditions hold:

is approximated well by the normal curve, and we can use the normal approximation to select the rejection region for the test using $Z$ as the test statistic. If the null hypothesis is true,

These three approximations yield three different z tests of the hypothesis that $\mu = \mu_0$ at approximate significance level $a$:

The word "tail" refers to the tails of the normal curve: In a left-tail test, the probability of a Type I error is approximately the area of the left tail of the normal curve, from minus infinity to $z_a$. In a right-tail test, the probability of a Type I error is approximately the area of the right tail of the normal curve, from $z_{1-a}$ to infinity. In a two-tail test, the probability of a Type I error is approximately the sum of the areas of both tails of the normal curve, the left tail from minus infinity to $z_{a/2}$ and the right tail from $z_{1-a/2}$ to infinity. All three of these tests are called z tests. The observed value of Z is called the z score.

Which of these three tests, if any, should one use? The answer depends on the probability distribution of Z when the alternative hypothesis is true. As a rule of thumb, if, under the alternative hypothesis, $E(Z) < 0$, use the left-tail test. If, under the alternative hypothesis, $E(Z) > 0$, use the right-tail test. If, under the alternative hypothesis, it is possible that $E(Z) < 0$ and it is possible that $E(Z) > 0$, use the two-tail test. If, under the alternative hypothesis, $E(Z) = 0$, consult a statistician. Generally (but not always), this rule of thumb selects the test with the most power for a given significance level.

P values for z tests

Each of the three z tests gives us a family of procedures for testing the null hypothesis at any (approximate) significance level $a$ between 0 and 100%—we just use the appropriate quantile of the normal curve. This makes it particularly easy to find the P value for a z test. Recall that the P value is the smallest significance level for which we would reject the null hypothesis, among a family of tests of the null hypothesis at different significance levels.

Suppose the z score (the observed value of $Z$) is $x$. In a left-tail test, the P value is the area under the normal curve to the left of $x$: Had we chosen the significance level $a$ so that $z_a=x$, we would have rejected the null hypothesis, but we would not have rejected it for any smaller value of $a$, because for all smaller values of $a$, $z_a < x$. Similarly, for a right-tail z test, the P value is the area under the normal curve to the right of $x$: If $x=z_{1-a}$ we would reject the null hypothesis at approximate significance level $a$, but not at smaller significance levels. For a two-tail z test, the P value is the sum of the area under the normal curve to the left of $-|x|$ and the area under the normal curve to the right of $|x|$.

Finding P values and specifying the rejection region for the z test involves the probability distribution of $Z$ under the assumption that the null hypothesis is true. Rarely is the alternative hypothesis sufficiently detailed to specify the probability distribution of $Z$ completely, but often the alternative does help us choose intelligently among left-tail, right-tail, and two-tail z tests. This is perhaps the most important issue in deciding which hypothesis to take as the null hypothesis and which as the alternative: We calculate the significance level under the null hypothesis, and that calculation must be tractable.

How close the normal approximations to the significance and power are to the true significance level and power depends on how well the normal curve approximates the probability histogram of the test statistic in standard units. If the original test statistic is a sample sum or a sample mean of draws with replacement (or a sum or difference of independent sample sums or sample means ), its probability histogram can be approximated accurately by a normal curve if the sample size is large; this is a consequence of the central limit theorem.

However, to construct a z test, we need to know the expected value and SE of the test statistic under the null hypothesis. Usually it is easy to determine the expected value, but often the SE must be estimated from the data. Later in this chapter we shall see what to do if the SE cannot be estimated accurately, but the shape of the distribution of the numbers in the population is known. The next section develops z tests for the population percentage and mean, and for the difference between two population means.

Examples of z tests

The central limit theorem assures us that the probability histogram of the sample mean of random draws with replacement from a box of tickets—transformed to standard units—can be approximated increasingly well by a normal curve as the number of draws increases. In the previous section, we learned that the probability histogram of a sum or difference of independent sample means of draws with replacement also can be approximated increasingly well by a normal curve as the two sample sizes increase. We shall use these facts to derive z tests for population means and percentages and differences of population means and percentages.

z Test for a Population Percentage

Suppose we have a population of $N$ units of which $G$ are labeled "1" and the rest are labeled "0." Let $p = G/N$ be the population percentage. Consider testing the null hypothesis that $p = p_0$ against the alternative hypothesis that $p \ne p_0$, using a random sample of $n$ units drawn with replacement. (We could assume instead that $N >> n$ and allow the draws to be without replacement.)

Provided $n$ is large and $p_0$ is not too close to zero or 100% (say $n \times p > 30$ and $n \times (1-p) > 30)$, the probability histogram of $Z$ will be approximated reasonably well by the normal curve, and we can use it as the Z statistic in a z test. For example, if we reject the null hypothesis when $|Z| > 1.96$, the significance level of the test will be about 95%.

z Test for a Population Mean

The approach in the previous subsection applies, mutatis mutandis, to testing the hypothesis that the population mean equals a given value, even when the population contains numbers other than just 0 and 1. However, in contrast to the hypothesis that the population percentage equals a given value, the null hypothesis that a more general population mean equals a given value does not specify the SD of the population, which poses difficulties that are surmountable (by approximation and estimation) if the sample size is large enough. (There are also nonparametric methods that can be used.)

Consider testing the null hypothesis that the population mean $\mu$ is equal to a specific null value $\mu_0$, against the alternative hypothesis that $\mu < \mu_0$, on the basis of a random sample with replacement of size $n$. Recall that the sample mean $M$ of $n$ random draws with or without replacement from a box of numbered tickets is an unbiased estimator of the population mean $\mu$: If

where $N$ is the size of the population. The population mean determines the expected value of the sample mean. The SE of the sample mean of a random sample with replacement is

where SD(box) is the SD of the list of all the numbers in the box, and $n$ is the sample size. As a special case, the sample percentage \phi of $n$ independent random draws from a 0-1 box is an unbiased estimator of the population percentage p, with SE equal to

In testing the null hypothesis that a population percentage $p$ equals $p_0$, the null hypothesis specifies not only the expected value of the sample percentage \phi, it automatically specifies the SE of the sample percentage as well, because the SD of the values in a 0-1 box is determined by the population percentage $p$:

The null hypothesis thus gives us all the information we need to standardize the sample percentage under the null hypothesis. In contrast, the SD of the values in a box of tickets labeled with arbitrary numbers bears no particular relation to the mean of the values, so the null hypothesis that the population mean $\mu$ of a box of tickets labeled with arbitrary numbers equals a specific value $\mu_0$ determines the expected value of the sample mean, but not the standard error of the sample mean. To standardize the sample mean to construct a z test for the value of a population mean, we need to estimate the SE of the sample mean under the null hypothesis. When the sample size is large, the sample standard deviation s> is likely to be close to the SD of the population, and

is likely to be an accurate estimate of $SE(M)$. The central limit theorem tells us that when the sample size $n$ is large, the probability histogram of the sample mean, converted to standard units, is approximated well by the normal curve. Under the null hypothesis,

has expected value zero, and its probability histogram is approximated well by the normal curve, so we can use $Z$ as the Z statistic in a z test. If the alternative hypothesis is true, the expected value of $Z$ could be either greater than zero or less than zero, so it is appropriate to use a two-tail z test. If the alternative hypothesis is $\mu > \mu_0$, then under the alternative hypothesis, the expected value of $Z$ is greater than zero, and it is appropriate to use a right-tail z test. If the alternative hypothesis is $\mu < \mu_0$, then under the alternative hypothesis, the expected value of $Z$ is less than zero, and it is appropriate to use a left-tail z test.

z Test for a Difference of Population Means

Consider the problem of testing the hypothesis that two population means are equal, using random samples from the two populations. Different sampling designs lead to different hypothesis testing procedures. In this section, we consider two kinds of random samples from the two populations: paired samples and independent samples, and construct z tests appropriate for each.

Paired Samples

Consider a population of $N$ individuals, each of whom is labeled with two numbers. For example, the $N$ individuals might be a group of doctors, and the two numbers that label each doctor might be the annual payments to the doctor by an HMO under the terms of the current contract and under the terms of a proposed revision of the contract. Let the two numbers associated with individual $i$ be $c_i$ and $t_i$. (Think of $c$ as control and $t$ as treatment. In this example, control is the current contract, and treatment is the proposed contract.) Let $\mu_c$ be the population mean of the $N$ values

against the alternative hypothesis that $\mu < \mu_0$. With $\mu_0=\$0$, this null hypothesis is that the average annual payment to doctors under the proposed revision would be the same as the average payment under the current contract, and the alternative is that on average doctors would be paid less under the new contract than under the current contract. With $\mu_0=-\$5,000$, this null hypothesis is that the proposed contract would save the HMO an average of $5,000 per doctor, compared with the current contract; the alternative is that under the proposed contract, the HMO would save even more than that. With $\mu_0=\$1,000$, this null hypothesis is that doctors would be paid an average of $1,000 more per year under the new contract than under the old one; the alternative hypothesis is that on average doctors would be paid less than an additional $1,000 per year under the new contract—perhaps even less than they are paid under the current contract. For the remainder of this example, we shall take $\mu_0=\$1,000$.

The data on which we shall base the test are observations of both $c_i$ and $t_i$ for a sample of $n$ individuals chosen at random with replacement from the population of $N$ individuals (or a simple random sample of size $n << N$): We select $n$ doctors at random from the $N$ doctors under contract to the HMO, record the current annual payments to them, and calculate what the payments to them would be under the terms of the new contract. This is called a paired sample, because the samples from the population of control values and from the population of treatment values come in pairs: one value for control and one for treatment for each individual in the sample. Testing the hypothesis that the difference between two population means is equal to $\mu_0$ using a paired sample is just the problem of testing the hypothesis that the population mean $\mu$ of the set of differences

is equal to $\mu_0$. Denote the $n$ (random) observed values of $c_i$ and $t_i$ by $\{C_1, C_2, \ldots, C_n\}$ and $\{T_1, T_2, \ldots, T_n \}$, respectively. The sample mean $M$ of the differences between the observed values of $t_i$ and $c_i$ is the difference of the two sample means:

\[ M = \frac{(T_1-C_1)+(T_2-C_2) + \cdots + (T_n-C_n)}{n} = \frac{T_1+T_2+ \cdots + T_n}{n} - \frac{C_1+C_2+ \cdots + C_n}{n} \]

\[ = (\mbox{sample mean of observed values of } t_i) - (\mbox{sample mean of observed values of } c_i). \]

$M$ is an unbiased estimator of $\mu$, and if n is large, the normal approximation to its probability histogram will be accurate. The SE of $M$ is the population standard deviation of the $N$ values $\{d_1, d_2, \ldots, d_N\}$, which we shall denote $SD_d$, divided by the square root of the sample size, $n^{1/2}$. Let $sd$ denote the sample standard deviation of the $n$ observed differences $(T_i - C_i), \;\; i=1, 2, \ldots, n$:

\[ sd = \sqrt{\frac{(T_1-C_1-M)^2 + (T_2-C_2-M)^2 + \cdots + (T_n-C_n-M)^2}{n-1}} \]

(recall that $M$ is the sample mean of the observed differences). If the sample size $n$ is large, sd is very likely to be close to SD(d), and so, under the null hypothesis,

has expected value zero, and when $n$ is large the probability histogram of $Z$ can be approximated well by the normal curve. Thus we can use $Z$ as the Z statistic in a z test of the null hypothesis that $\mu=\mu_0$. Under the alternative hypothesis that $\mu<\mu_0$ (doctors on the average are paid less than an additional $1,000 per year under the new contract), the expected value of $Z$ is less than zero, so we should use a left-tail z test. Under the alternative hypothesis $\mu\ne\mu_0$ (on average, the difference in average annual payments to doctors is not an increase of $1,000, but some other number instead), the expected value of $Z$ could be positive or negative, so we would use a two-tail z test. Under the alternative hypothesis that $\mu>\mu_0$ (on average, under the new contract, doctors are paid more than an additional $1,000 per year), the expected value of $Z$ would be greater than zero, so we should use a right-tail z test.

Independent Samples

Consider two separate populations of numbers, with population means $\mu_t$ and $\mu_c$, respectively. Let $\mu=\mu_t-\mu_c$ be the difference between the two population means. We would like to test the null hypothesis that $\mu=\mu_0$ against the alternative hypothesis that $\mu>0$. For example, let $\mu_t$ be the average annual payment by an HMO to doctors in the Los Angeles area, and let $\mu_c$ be the average annual payment by the same HMO to doctors in the San Francisco area. Then the null hypothesis with $\mu_0=0$ is that the HMO pays doctors in the two regions the same amount annually, on average; the alternative hypothesis is that the average annual payment by the HMO to doctors differs between the two areas. Suppose we draw a random sample of size $n_t$ with replacement from the first population, and independently draw a random sample of size $n_c$ with replacement from the second population. Let $M_t$ and $M_c$ be the sample means of the two samples, respectively, and let

be the difference between the two sample means. Because the expected value of $M_t$ is $\mu_t$ and the expected value of $M_c$ is $\mu_c$, the expected value of $M$ is

Because the two random samples are independent, $M_t$ and $-M_c$ are independent random variables, and the SE of their sum is

Let $s_t$ and $s_c$ be the sample standard deviations of the two samples, respectively. If $n_t$ and $n_c$ are both very large, the two sample standard deviations are likely to be close to the standard deviations of the corresponding populations, and so $s_t/n_t^{1/2}$ is likely to be close to $SE(M_t)$, and $s_c/n_c^{1/2}$ is likely to be close to $SE(M_c)$. Therefore, the pooled estimate of the standard error

\[ se_\mbox{diff} = ( (s_t/n_t^{1/2})^2 + (s_c/n_c^{1/2})^2)^{1/2} = \sqrt{ s_t^2/n_t + s_c^2/n_c} \]

\[ Z = \frac{M - \mu_0}{se_\mbox{diff}} = \frac{M_1 - M_2 - \mu_0}{\sqrt{ s_t^2/n_t + s_c^2/n_c}} \]

has expected value zero and its probability histogram is approximated well by the normal curve, so we can use it as the Z statistic in a z test.

the expected value of $Z$ is greater than zero, so it is appropriate to use a right-tail z test.

If the alternative hypothesis were $\mu \ne \mu_0$, under the alternative the expected value of $Z$ could be greater than zero or less than zero, so it would be appropriate to use a two-tail z test. If the alternative hypothesis were $\mu < \mu_0$, under the alternative the expected value of $Z$ would be less than zero, so it would be appropriate to use a left-tail z test.

The following exercises check that you can compute the z test for a population mean or a difference of population means. The exercises are dynamic: the data will tend to change when you reload the page.

t Tests

For the nominal significance level of the z test for a population mean to be approximately correct, the sample size typically must be large. When the sample size is small, two factors limit the accuracy of the z test: the normal approximation to the probability distribution of the sample mean can be poor, and the sample standard deviation can be an inaccurate estimate of the population standard deviation, so se is not an accurate estimate of the SE of the test statistic Z. For nearly normal populations, defined in the next subsection, the probability distribution of the sample mean is nearly normal even when the sample size is small, and the uncertainty of the sample standard deviation as an estimate of the population standard deviation can be accounted for by using a curve that is broader than the normal curve to approximate the probability distribution of the (approximately) standardized test statistic. The broader curve is Student's t curve. Student's t curve depends on the sample size: The smaller the sample size, the more spread out the curve.

Nearly Normally Distributed Populations

A list of numbers is nearly normally distributed if the fraction of values in any range is close to the area under the normal curve for the corresponding range of standard units—that is, if the list has mean $\mu$ and standard deviation SD, and for every pair of values $a < b$,

\[ \mbox{ the fraction of numbers in the list between } a \mbox{ and } b \approx \mbox{the area under the normal curve between } (a - \mu)/SD \mbox{ and } (b - \mu)/SD. \]

A list is nearly normally distributed if the normal curve is a good approximation to the histogram of the list transformed to standard units. The histogram of a list that is approximately normally distributed is (nearly) symmetric about some point, and is (nearly) bell-shaped.

No finite population can be exactly normally distributed, because the area under the normal curve between every two distinct values is strictly positive—no matter how large or small the values nor how close together they are. No population that contains only a finite number of distinct values can be exactly normally distributed, for the same reason. In particular, populations that contain only zeros and ones are not approximately normally distributed, so results for the sample mean of samples drawn from nearly normally distributed populations need not apply to the sample percentage of samples drawn from 0-1 boxes. Such results will be more accurate for the sample percentage when the population percentage is close to 50% than when the population percentage is close to 0% or 100%, because then the histogram of population values is more nearly symmetric.

Suppose a population is nearly normally distributed. Then a histogram of the population is approximately symmetric about the mean of the population. The fraction of numbers in the population within ±1 SD of the mean of the population is about 68%, the fraction of numbers within ±2 SD of the mean of the population is about 95%, and the fraction of numbers in the population within ±3 SD of the mean of the population is about 99.7%.

The following exercises check that you understand what it means for a list to be nearly normally distributed. The exercises are dynamic: the data tend to change when you reload the page.

Student's t-curve

Student's t curve is similar to the normal curve, but broader. It is positive, has a single maximum, and is symmetric about zero. The total area under Student's t curve is 100%. Student's t curve approximates some probability histograms more accurately than the normal curve does. There are actually infinitely many Student t curves, one for each positive integer value of the degrees of freedom. As the degrees of freedom increases, the difference between Student's t curve and the normal curve decreases.

Consider a population of $N$ units labeled with numbers. Let $\mu$ denote the population mean of the $N$ numbers, and let SD denote the population standard deviation of the $N$ numbers. Let $M$ denote the sample mean of a random sample of size $n$ drawn with replacement from a population, and let s> denote the sample standard deviation of the sample. The expected value of $M$ is $\mu$, and the SE of $M$ is $SD/n^{1/2}$. Let

Then the expected value of $Z$ is zero, the SE of $Z$ is 1, and if $n$ is large enough, the normal curve is a good approximation to the probability histogram of $Z$. The closer to normal the distribution of values in the population is, the smaller $n$ needs to be for the normal curve to be a good approximation to the distribution of $Z$. Consider the statistic

which replaces SD by its estimated value (the sample standard deviation $s$). If $n$ is large enough, $s$ is very likely to be close to SD, so $T$ will be close to $Z$; the normal curve will be a good approximation to the probability histogram of $T$; and we can use $T$ as the Z statistic in a z test of hypotheses about $\mu$.

For many populations, when the sample size is small—say less than 25, but the accuracy depends on the population—the normal curve is not a good approximation to the probability histogram of $T$. For nearly normally distributed populations, when the sample size is intermediate—say 25–100, but again this depends on the population—the normal curve is a good approximation to the probability histogram of $Z$, but not to the probability histogram of $T$, because of the variability of the sample standard deviation s> from sample to sample, which tends to broaden the probability distribution of $T$ (i.e., to make $SE(T)>1$).

For nearly normally distributed populations, Student's t curve is a better approximation to the probability histogram of $T$ than the normal curve is. Student's t curve is broader and flatter than the normal curve, which accounts for the extra variability in the distribution of $T$. Actually, Student's t curve is not one curve: It is a family of curves, one for each value of the degrees of freedom d.f., 1, 2, …. In approximating the probability histogram of $T$, the appropriate value of d.f. to use is $n -1$, one less than the sample size. When d.f. is small, Student's t curve is much broader and flatter than the normal curve. As d.f. grows, Student's t curve gets closer and closer to the normal curve; for d.f. over 200, the two curves are essentially indistinguishable. For every value of the degrees of freedom, the total area under Student's t curve is 100%, the curve has a single peak at zero, and the curve is symmetric about zero. plots Student's t curve for various values of the degrees of freedom, and gives the area under Student's t curve over any interval.

When you first load this page, the degrees of freedom will be set to 25, and the region from -1.96 to 1.96 will be hilighted. The area under the normal curve between ±1.96 is 95%, but for Student's t curve with 25 degrees of freedom, the area is about 93.9%: Student's t curve with d.f.=25 is broader than the normal curve. Increase the degrees of freedom to 200; you will see that the Student t curve gets slightly narrower, and the area under the curve between ±1.96 is about 94.9%.

We define quantiles of Student t curves in the same way we defined quantiles of the normal curve: For any number a between 0 and 100%, the a quantile of Student's t curve with $d.f.=d$, $t_{d,a}$, is the unique value such that the area under the Student t curve with d degrees of freedom from minus infinity to $t_{d,a}$ is equal to $a$. For example, $t_{d,0.5} = 0$ for all values of $d$. Generally, the value of $t_{d,a}$ depends on the degrees of freedom $d$. The probability calculator allows you to find quantiles of Student's t curve.

t test for the Mean of a Nearly Normally Distributed Population

We can use Student's t curve to construct approximate tests of hypotheses about the population mean $\mu$ when the population standard deviation is unknown, for intermediate values of the sample size $n$. The approach is directly analogous to the z test, but instead of using a quantile of the normal curve, we use the corresponding quantile of Student's t curve (with the appropriate number of degrees of freedom). However, for the test to be accurate when $n$ is small or intermediate, the distribution of values in the population must be nearly normal for the test to have approximately its nominal level. This is a somewhat bizarre restriction: It may require a very large sample to detect that the population is not nearly normal—but if the sample is very large, we can use the z test instead of the t test, so we don't need to rely as much on the assumption. It is my opinion that the t test is over-taught and overused—because its assumptions are not verifiable in the situations where it is potentially useful.

Consider testing the null hypothesis that $\mu=\mu_0$ using the sample mean $M$ and sample standard deviation s> of a random sample of size $n$ drawn with replacement from a population that is known to have a nearly normal distribution. Define

Under the null hypothesis, if $n$ is not too small, Student's t curve with $n-1$ degrees of freedom will be an accurate approximation to the probability histogram of $T$, so

all are approximately equal to $a$. As we saw earlier in this chapter for the Z statistic, these three approximations give three tests of the null hypothesis $\mu=\mu_0$ at approximate significance level $a$—a left-tail t test, a right-tail t test, and a two-tail t test:

To decide which t test to use, we can apply the same rule of thumb we used for the z test:

P-values for t tests are computed in much the same way as P-values for z tests. Let t be the observed value of $T$ (the t score). In a left-tail t test, the P-value is the area under Student's t curve with $n-1$ degrees of freedom, from minus infinity to $t$. In a right-tail t test, the P-value is the area under Student's t curve with $n-1$ degrees of freedom, from $t$ to infinity. In a two-tail t test, the P-value is the total area under Student's t curve with $n-1$ degrees of freedom between minus infinity and $-|t|$ and between $|t|$ and infinity.

There are versions of the t test for comparing two means, as well. Just like for the z test, the method depends on how the samples from the two populations are drawn. For example, if the two samples are paired (if we are sampling individuals labeled with two numbers and for each individual in the sample, we observe both numbers), we may base the t test on the sample mean of the paired differences and the sample standard deviation of the paired differences. Let $\mu_1$ and $\mu_2$ be the means of the two populations, and let

\[ T = \frac{(\mbox{sample mean of differences}) - \mu_0 }{(\mbox{sample standard deviation of differences})/n^{1/2}}, \]

and the appropriate curve to use to find the rejection region for the test is Student's t curve with $n-1$ degrees of freedom, where $n$ is the number of individuals (differences) in the sample.

Two-sample t tests for a difference of means using independent samples depend on additional assumptions, such as equality of the two population standard deviations; we shall not present such tests here. The following exercises check your ability to compute t tests. The exercises are dynamic: the data tend to change when you reload the page.

Hypothesis Tests and Confidence Intervals

There is a deep connection between hypothesis tests about parameters, and confidence intervals for parameters. If we have a procedure for constructing a level $100\% \times (1-a)$ confidence interval for a parameter $\mu$, then the following rule is a two-sided significance level $a$ test of the null hypothesis that $\mu = \mu_0$:

reject the null hypothesis if the confidence interval does not contain $\mu_0$.

Similarly, suppose we have an hypothesis-testing procedure that lets us test the null hypothesis that $\mu=\mu_0$ for any value of $\mu_0$, at significance level $a$. Define

$A$ = (all values of $\mu_0$ for which we would not reject the null hypothesis that $\mu = \mu_0$).

(A confidence set is a generalization of the idea of a confidence interval: a $1-a$ confidence set for the parameter $\mu$ is a random set that has probability $1-a$ of containing $\mu$. As is the case with confidence intervals, the probability makes sense only before collecting the data.) The set $A$ might or might not be an interval, depending on the nature of the test. If one starts with a two-tail z test or two-tail t test, one ends up with a confidence interval rather than a more general confidence set.

Confidence Intervals Using Student's t curve

The t test lets us test the hypothesis that the population mean $\mu$ is equal to $\mu_0$ at approximate significance level a using a random sample with replacement of size n from a population with a nearly normal distribution. If the sample size n is small, the actual significance level is likely to differ considerably from the nominal significance level. Consider a two-sided t test of the hypothesis $\mu=\mu_0$ at significance level $a$. If the sample mean is $M$ and the sample standard deviation is $s$, we would not reject the null hypothesis at significance level $a$ if

\[ -t_{n-1,1-a/2} \times s/n^{1/2} \le M - \mu_0 \le t_{n-1,1-a/2} \times s/n^{1/2} \]

\[ -M - t_{n-1,1-a/2} \times s/n^{1/2} \le - \mu_0 \le -M + t_{n-1,1-a/2} \times s/n^{1/2} \]

\[ M + t_{n-1,1-a/2} \times s/n^{1/2} \le \mu_0 \le M - t_{n-1,1-a/2} \times s/n^{1/2} \]

\[ M - t_{n-1,1-a/2} \times s/n^{1/2} \le \mu_0 \le M + t_{n-1,1-a/2} \times s/n^{1/2}. \]

That is, we would not reject the hypothesis $\mu = \mu_0$ provided $\mu_0$ is in the interval

\[ P([M - t_{n-1,1-a/2} \times s/n^{1/2}, M + t_{n-1,1-a/2} \times s/n^{1/2}] \mbox{ contains } \mu) \approx 1-a. \]

The following exercise checks that you can use Student's t curve to construct a confidence interval for a population mean. The exercise is dynamic: the data tend to change when you reload the page.

Summary

In hypothesis testing, a Z statistic is a random variable whose probability histogram is approximated well by the normal curve if the null hypothesis is correct: If the null hypothesis is true, the expected value of a Z statistic is zero, the SE of a Z statistic is approximately 1, and the probability that a Z statistic is between $a$ and $b$ is approximately the area under the normal curve between $a$ and $b$. Suppose that the random variable $Z$ is a Z statistic. If, under the alternative hypothesis, $E(Z) < 0$, the appropriate z test to test the null hypothesis at approximate significance level $a$ is the left-tailed z test: Reject the null hypothesis if $Z < z_a$, where $z_a$ is the $a$ quantile of the normal curve. If, under the alternative hypothesis, $E(Z)>0$, the appropriate z test to test the null hypothesis at approximate significance level $a$ is the right-tailed z test: Reject the null hypothesis if $Z>z_{1-a}$. If, under the alternative hypothesis, $E(Z)\ne 0 $ but could be greater than 0 or less than 0, the appropriate z test to test the null hypothesis at approximate significance level $a$ is the two-tailed z test: reject the null hypothesis if $|Z|>z_{1-a/2}$. If, under the alternative hypothesis, $E(Z)=0$, a z test probably is not appropriate—consult a statistician. The exact significance levels of these tests differ from $a$ by an amount that depends on how closely the normal curve approximates the probability histogram of $Z$.

Z statistics often are constructed from other statistics by transforming approximately to standard units, which requires knowing the expected value and SE of the original statistic on the assumption that the null hypothesis is true. Let $X$ be a test statistic; let $E(X)$ be the expected value of $X$ if the null hypothesis is true, and let $se$ be approximately equal to the SE of $X$ if the null hypothesis is true. If $X$ is a sample sum of a large random sample with replacement, a sample mean of a large random sample with replacement, or a sum or difference of independent sample means of large samples with replacement,

Consider testing the null hypothesis that a population percentage $p$ is equal to the value $p_0$ on the basis of the sample percentage \phi of a random sample of size $n$ with replacement. Under the null hypothesis, $E(\phi)=p_0$ and

and if $n$ is sufficiently large (say $n \times p > 30$ and $n \times (1-p)>30$, but this depends on the desired accuracy), the normal approximation to

will be reasonably accurate, so $Z$ can be used as the Z statistic in a z test of the null hypothesis $p=p_0$.

Consider testing the null hypothesis that a population mean $\mu$ is equal to the value $\mu_0$, on the basis of the sample mean $M$ of a random sample of size $n$ with replacement. Let $s$ denote the sample standard deviation. Under the null hypothesis, $E(M)=\mu_0$, and if $n$ is large,

will be reasonably accurate, so $Z$ can be used as the Z statistic in a z test of the null hypothesis $\mu=\mu_0$.

Consider a population of $N$ individuals, each labeled with two numbers. The $i$th individual is labeled with the numbers $c_i$ and $t_i$, $i=1, 2, \ldots, N$. Let $\mu_c$ be the population mean of the $N$ values $\{c_1, \ldots, c_N\}$ and let $\mu_t$ be the population mean of the $N$ values $\{t_1, \ldots, t_N \}$. Let $\mu=\mu_t-\mu_c$ be the difference between the two population means. Consider testing the null hypothesis that $\mu=\mu_0$ on the basis of a paired random sample of size $n$ with replacement from the population: that is, a random sample of size $n$ is drawn with replacement from the population, and for each individual $i$ in the sample, $c_i$ and $t_i$ are observed. This is equivalent to testing the hypothesis that the population mean of the $N$ values $\{(t_1-c_1), \ldots, (t_N-c_N)\}$ is equal to $\mu_0$, on the basis of the random sample of size $n$ drawn with replacement from those $N$ values. Let $M_t$ be the sample mean of the $n$ observed values of $t_i$ and let $M_c$ be the sample mean of the $n$ observed values of $c_i$. Let $sd$ denote the sample standard deviation of the $n$ observed differences $\{(t_i-c_i)\}$. Under the null hypothesis, the expected value of $M_t-M_c$ is $\mu_0$, and if $n$ is large,

will be reasonably accurate, so $Z$ can be used as the Z statistic in a z test of the null hypothesis that $\mu_t-\mu_c=\mu_0$.

Consider testing the hypothesis that the difference ($\mu_t-\mu_c$) between two population means, $\mu_c$ and $\mu_t$, is equal to $\mu_0$, on the basis of the difference ($M_t-M_c$) between the sample mean $M_c$ of a random sample of size $n_c$ with replacement from the first population and the sample mean $M_t$ of an independent random sample of size $n_t$ with replacement from the second population. Let $s_c$ denote the sample standard deviation of the sample of size $n_c$ from the first population and let $s_t$ denote the sample standard deviation of the sample of size $n_t$ from the second population. If the null hypothesis is true,

will be reasonably accurate, so $Z$ can be used as the Z statistic in a z test of the null hypothesis that $\mu_t-\mu_c=\mu_0$.

A list of numbers is nearly normally distributed if the fraction of numbers between any pair of values, $a < b$, is approximately equal to the area under the normal curve between $(a-\mu)/SD$ and $(b-\mu)/SD$, where $\mu$ is the mean of the list and SD is the standard deviation of the list.

Student's t curve with $d$ degrees of freedom is symmetric about 0, has a single bump centered at 0, and is broader and flatter than the normal curve. The total area under Student's t curve is 1, no matter what $d$ is; as $d$ increases, Student's t curve gets narrower, its peak gets higher, and it becomes closer and closer to the normal curve.

Let $M$ be the sample mean of a random sample of size $n$ with replacement from a population with mean $\mu$ and a nearly normal distribution, and let $s$ be the sample standard deviation of the random sample. For moderate values of $n$ ($n < 100$ or so), Student's t curve approximates the probability histogram of $(M-\mu)/(s/n^{1/2})$ better than the normal curve does, which can lead to an approximate hypothesis test about $\mu$ that is more accurate than the z test.

Consider testing the null hypothesis that the mean $\mu$ of a population with a nearly normal distribution is equal to $\mu_0$ from a random sample of size $n$ with replacement. Let

where $M$ is the sample mean and $s$ is the sample standard deviation. The tests that reject the null hypothesis if $T < t_{n-1,a}$ (left-tail t test), if $T>t_{n-1,1-a}$ (right-tail t test), or if $|T|>t_{n-1,1-a/2}$ (two-tail t test) all have approximate significance level $a$. How close the nominal significance level $a$ is to the true significance level depends on the distribution of the numbers in the population, the sample size $n$, and $a$. The same rule of thumb for selecting whether to use a left, right, or two-tailed z test (or not to use a z test at all) works to select whether to use a left, right, or two-tailed t test: If, under the alternative hypothesis, $E(T) < 0 $, use a left-tail test. If, under the alternative hypothesis, $E(T) > 0 $, use a right-tail test. If, under the alternative hypothesis, $E(T)$ could be less than zero or greater than zero, use a two-tail test. If, under the alternative hypothesis, $E(T) = 0 $, consult an expert. Because the t test differs from the z test only when the sample size is small, and from a small sample it is not possible to tell whether the population has a nearly normal distribution, the t test should be used with caution.

A $1-a$ confidence set for a parameter $\mu$ is like a $1-a$ confidence interval for a parameter $\mu$: It is a random set of values that has probability $1-a$ of containing the true value of $\mu$. The difference is that the set need not be an interval.

There is a deep duality between hypothesis tests about a parameter $\mu$ and confidence sets for $\mu$. Given a procedure for constructing a $1-a$ confidence set for $\mu$, the rule reject the null hypothesis that $\mu=\mu_0$ if the confidence set does not contain $\mu$ is a significance level $a$ test of the null hypothesis that $\mu=\mu_0$. Conversely, given a family of significance level $a$ hypothesis tests that allow one to test the hypothesis that $\mu=\mu_0$ for any value of $\mu_0$, the set of all values $\mu_0$ for which the test does not reject the null hypothesis that $\mu=\mu_0$ is a $1-a$ confidence set for $\mu$.

Approximate Hypothesis Tests: the z Test and the t Test