The Multinomial Distribution and the Chi-Squared Test for Goodness of Fit

presented hypothesis tests in a general setting. presented exact and approximate hypothesis testing procedures for population percentages. presented approximate tests of hypotheses about population means. All the examples of hypothesis testing so far have involved counts of outcomes that are dichotomous (categorical data with only two categories—good and bad—or quantitative data that have only two possible values—0 and 1), or have involved quantitative data. This chapter presents hypothesis tests and approximate hypothesis tests for probability models of categorical data. Along the way, it introduces joint probability distributions and the chi-square curve, which approximates the probability histogram of a random variable introduced in the chapter, the chi-square statistic.

The Multinomial Distribution

The multinomial probability distribution is a probability model for random categorical data: If each of n independent trials can result in any of k possible types of outcome, and the probability that the outcome is of a given type is the same in every trial, the numbers of outcomes of each of the k types have a multinomial joint probability distribution. This section develops the multinomial distribution; later in the chapter we develop hypothesis tests that a given multinomial model is correct, using the observed counts of data in each of the categories.

Suppose we have an experiment that will produce categorical data: The outcome can fall in any of k categories, where k > 1 is known. Let pi be the probability that the outcome is in category i, for i = 1, 2, …, k. (We assume that the categories are disjoint—a given outcome cannot be in more than one category—and exhaustive—each datum must fall in some category. That is, each datum must be in one and only one of the k categories. It follows that p1 + p1 + … + pk = 100%.)

For example, consider rolling a fair die. The side that lands on top can be in any of six categories: 1, 2, … , 6, according to the number of spots it has. The corresponding category probabilities are

p1 = p2 = … = p6 = 1/6.

Now consider repeating the experiment n times, independently, and recording how many times each type of outcome occurs. The outcome space is a set of k counts: the number of trials that result in an outcome of type i, for i = 1, 2, …, k. Let Xi be the number of trials in which the outcome was in category i. Because there are n trials, each of which must result in one of the k possible outcomes,

X1 + X1 + … + Xk = n.

For example, consider rolling the die four times, so n = 4. X1 is the number of times the side with one spot shows in four rolls of the die; X2 is the number of times the side with two spots shows in same four rolls of the die; etc. One possible outcome is

(X1 = 2, X2 = 1, X3 = 1, X4 = 0, X5 = 0, X6 = 0),

which means one spot showed in two rolls, two spots showed in one roll, three spots showed in one roll, and the other faces (four, five, six) did not show. The outcome

(X1 = 2, X2 = 2, X3 = 1, X4 = 0, X5 = 0, X6 = 0),

is impossible, because it would require five rolls of the die (X1+X2+X3+ X4+X5+X6 = 5).

The number of outcomes in category i is like the number of successes in n independent trials with the same probability pi of success in each trial, so Xi has a binomial distribution with parameters n and pi. The random variables {X1, X2, …, Xk} are dependent.

For example, if X1 = n, it follows that

X2= … =  Xk = 0.

Similarly,

X1 = n - ( X2 + … + Xk ).

The variables are informative with respect to each other, so they are not independent.

P( X1 = n1 and X2 = n2 and … and Xk = nk )

is not in general equal to

P(X1 = n1) × P(X2 = n2 ) × … × P(Xk = nk).

We can find

P( X1 = n1 and X2 = n2 and … and Xk = nk )

using logic similar to that we used to find the binomial distribution. The difference is that here there can be more than two categories of outcome (k can be greater than 2), while for the binomial, there were exactly two categories, "success" and "failure."

Consider the n trials in sequence. Let

n1, n2, …, nk

be nonnegative integers whose sum is n. In how many ways can the n trials result in n1 outcomes of type 1, n2 outcomes of type 2, …, and nk outcomes of type k? There are nCn1 ways to allocate the n1 outcomes of type 1 among the n trials. For each of those, there are n-n1Cn2 ways to allocate the n2 outcomes of type 2 among the remaining n-n1 trials. For each of those, there are n-n1-n2 Cn3 ways to allocate the n3 outcomes of type 3 among the remaining n-n1-n2 trials, etc. Finally, there are only nk spaces left for the nk outcomes of type k. According to the fundamental rule of counting, the total number of ways is therefore

nCn1 × (n-n1)Cn2 × (n-n1-n2) Cn3 × … × (n-n1-n2- … - nk-2) Cnk-1.

There are many cancellations in this product; the expression simplifies to

n!
------------------------------- .
n1! × n2! × … × nk!

What is the probability of each such sequence of outcomes? The trials are independent, so the chance of each sequence with n1 outcomes of type 1, n2 outcomes of type 2, … , and nk outcomes of type k is

p1n1 × p2n2 × … × pknk.

Therefore, the chance that the n trials result in n1 outcomes of type 1, n2 outcomes of type 2, … , and nk outcomes of type k is

n!
---------------------- × p1n1 × p2n2 × … × pknk,
n1! × n2! × … × nk!

if n1, … , nk are nonnegative integers that sum to n. (Otherwise, the chance is zero.) This is called the multinomial distribution with parameters n and p1, … , pk.

The Multinomial Distribution

Let {X1, X2, … , Xk}, k > 1, be a set of random variables, each of which can take the values 0, 1, … , n.

Suppose there are k nonnegative numbers {p1, p2, … , pk} that sum to one, such that for every set of k nonnegative integers {n1, … , nk} whose sum is n,

P( X1 = n1 and X2 = n1 andand Xk = nk ) =

n!
---------------------- × p1n1 × p2n2 × … × pknk .
n1! × n2! × … × nk!

Then {X1, X1, … , Xk} have a multinomial joint distribution with parameters n and p1, p2, … , pk.

The parameter n is called the number of trials; the parameters p1, p2, … , pk are called the category probabilities; k is called the number of categories.

What kinds of variables have a multinomial joint distribution? The canonical example of random variables with a multinomial joint distribution are the numbers of observations in each of k categories in n independent trials, where the probability pi that the observation is in category i is the same in every trial, and the categories are disjoint and exhaustive: every observation must be in exactly one of the k categories. If the number of categories or their probabilities vary from trial to trial, if the number of trials is not fixed in advance, if the trials are dependent, if an observation can be in more than one category, or if an observation can be in none of the categories, the resulting counts do not have a multinomial joint distribution.

Note that in the special case k = 2, the multinomial probability reduces to the binomial probability

p1n1 × p2n2 × n!/(n1! × n2!) = p1n1 × (1 - p1)n - n1 × n!/(n1! × (n - n1!))

nCn1 × p1n1 × (1 - p1)n - n1.

Continuing the example of rolling a fair die four times, we find

P(X1 = 2, X2 = 1, X3 = 1, X4 = 0, X5 = 0, X6 = 0) =

= (1/6)2 × (1/6)1 × (1/6)1 × (1/6)0 × (1/6)0×(1/6)0 × 4!/(2!×1!×1!×0!×0!×0!) =

= (1/6)4 × 24/2 = 1/108.

The following exercise checks your ability to compute using the multinomial distribution.

The chi-square statistic

The chi-square statistic is a summary measure of how well the observed frequencies of categorical data match the frequencies that would be expected under the null hypothesis that a particular multinomial probability model for the data is correct.

Suppose we would like to test the null hypothesis that a set of categorical data arises from a multinomial distribution with k categories and category probabilities p1, …, pk. (For example, suppose we want to test the hypothesis that a die is fair on the basis of the numbers of times the die lands with each of its six faces showing in 100 independent rolls.) We could base a test on the differences between the observed and expected numbers of outcomes in each of the k categories. If those differences are all small, the data are consistent with the null hypothesis. If those differences are sufficiently large, either the null hypothesis is false, or an event has occurred that has small probability. How small is small enough to be acceptable? How large is large enough to be surprising?

The standard error of Xi measures how far Xi is its from its expected value, on the average (it is the square-root of the expected deviation of Xi from its expected value). It makes sense to measure the difference between Xi and its expected value as a multiple of the standard error of Xi. (For example, Chebychev's inequality bounds the chance that X is many SEs from its expected value.) Dividing each discrepancy by its standard error also puts the k categories on an equal footing, which will help us combine them into a single summary measurement of how far the data are from their expected values.

Under the null hypothesis, the number of outcomes in category i has a binomial probability distribution with parameters n and pi, so the expected value of Xi, the number of outcomes in category i, is

EXi = n×pi.

Note that the sum of the expected values of the k variables is

n×p1 + n×p2 + … + n×pk = n × (p1 + p2 + … + pk ) = n × 1 = n.

The standard error of Xi is

SE(Xi) = ( n × pi × (1 - pi) )½.

To put all the discrepancies on an equal footing, we can divide them by their standard errors (under the null hypothesis), which leads us to consider the standardized variables

Xin×pi
---------------------------- .
( n×pi×(1 − pi) )½

Unless every discrepancy is zero, some will be positive and some will be negative, because they must sum to zero. To get an overall measure of the size of the discrepancies, we might square the normalized discrepancies to make them all positive, then add the squares. Squaring the discrepancies keeps differences of opposite signs from canceling each other—regardless of sign, it is the size of each discrepancy that matters.

This leads us to consider the summary measure

( X1 - n×p1 )2 ( X2 - n×p2 )2 ( Xk - n×pk )2
------------------------- + ------------------------- + + ------------------------- .
n×p1×(1 - p1) n×p2×(1 - p2) n×pk×(1 - pk)

There are theoretical reasons, beyond the scope of this book, that make it preferable to omit the factors (1 - pi) in the denominators of the terms in the sum. (If there are many categories, and none of the category probabilities is large, then (1 - pi)½ is nearly unity, and it does not matter whether we include the factors.) This leads to the summary statistic

( X1 - n×p1 )2 ( X2 - n×p2 )2 ( Xk - n×pk )2
chi-squared = ------------------------- + ------------------------- + + ------------------------- .
n×p1 n×p2 n×pk

which also can be written

( X1 - E(X1) )2 ( X2 - E(X2) )2 ( Xk - E(Xk) )2
chi-squared = ------------------------- + ------------------------- + + ------------------------- .
E(X1) E(X2) E(Xk)

Let oi denote the number times an outcome in category i occurs, and let ei denote the expected number of outcomes in category i on the assumption that the null hypothesis is true. Then the chi-squared statistic is the sum of

(oi -  ei)2/ei,

over all categories i = 1, 2, …, k.

The following exercise checks your understanding of the development so far—your ability to compute the expected numbers of outcomes in a multinomial model, and your ability to calculate the chi-squared statistic. The exercise is dynamic: the data tend to change when you reload the page.

The sampling distribution of the chi-squared statistic

The chi-squared statistic is a summary measure of how far the observed numbers of counts in each category are from their expected values, given a multinomial probability model for the data under the null hypothesis. It would be reasonable to reject the null hypothesis if chi-squared is large. But how large is large? If the null hypothesis is true, how large does the chi-squared statistic tend to be? What threshold value x can we set for chi-squared so that, if the null hypothesis is true,

P(chi-squared > x) ≤ p?

In general, the answer depends on the number of trials, the number of categories, and the probability of each category; but we shall see that there are regularities—there is an approximation that depends only on the number of categories, and is accurate provided the expected count in every category is large.

Let us look at the sampling distribution of the chi-squared statistic empirically. below starts with four category probabilities: 0.1, 0.2, 0.3, 0.4. Note that the sum of these probabilities is 1. Sample Size is set to 5 initially. When you click the Take Sample button, the computer simulates drawing a random sample of size 5 with replacement from the four categories, and computes

chi-squared = (o1 - 5×0.1)2/(5 × 0.1) + (o2 - 5×0.2)2/(5 × 0.2) + (o3 - 5×0.3)2/(5 × 0.3) + (o4 - 5×0.4)2/(5 × 0.4),

where oi is the number of elements of the random sample that were in category i, for i = 1, …, 4. It then plots this value in a histogram in the main panel of the tool. Every time you click the button, the computer takes another random sample and appends the observed value of the chi-squared statistic to the list of values plotted in the histogram.

Click the button a few times to get a feel for what happens. Change the value of the Take__________samples control to 1000. Now when you click the button, the computer will draw 1000 samples of size 5 and append the 1000 observed values of the chi-squared statistic to the list of values plotted in the histogram. Click the "Take Sample" button until you have drawn 10,000 samples of size 5.

Because of the law of large numbers, the histogram of 10,000 observed values of the chi-squared statistic is quite likely to be close to the probability histogram of the chi-squared statistic for this set of category probabilities and this sample size (ignoring differences caused by the choice of bins). The histogram starts high near zero, rises to a peak near two, then descends, but has a few "spikes" at unusually common values. It is skewed to the right. The area under the histogram to the right of 7.8 is about 2%. Increase the Sample Size control to 50, and take 10,000 samples. The histogram will look much more filled in and regular, but still will have some spikes at particularly probable values. The area under the histogram to the right of 7.8 is roughly 5%. Increase the Sample Size control to 300, and take 10,000 samples. Now the histogram will be very regular, with one mode just below 2, and skewed to the right. The area under the histogram to the right of 7.8 will be very close to 5%. Clear the histogram by clicking in the box of probabilities at the right hand side of the figure then clicking again anywhere else in the figure, and repeat the experiment of drawing 10,000 samples of size 300 several times to verify that the area to the right of 7.8 is always about 5%.

Replace the four category probabilities with four different probabilities, and repeat the experiment of increasing the sample size from 5 to 300, drawing 10,000 samples each time. You should find that when the sample size is small, the histogram is rough and the area to the right of 7.8 depends on the category probabilities, but when the sample size is 300, the area to the right of 7.8 is always about 5%, regardless of the category probabilities, provided none of the category probabilities is too small (not less than 0.05 or so).

Under the histogram is a drop-down menu that says No Curve when you first load the page. Select Chi-squared Curve instead of No Curve. A curve will be superposed on the histogram. This is the chi-squared curve with 3 degrees of freedom. The area under the chi-squared curve with 3 degrees of freedom to the right of 7.8 is 5%; that area will be displayed under the histogram next to the area of the highlighted part of the histogram. Highlight different ranges of values and compare the area under the histogram with the area under the curve. The two will be close. Change Sample Size back to 5, draw 10,000 samples, and compare the area under the histogram with the area under the curve for different ranges. You will find that the two tend to differ considerably.

Change the number of category probabilities and their values, and repeat the experiment for different sample sizes. When you change the number of category probabilities, the curve that is displayed is the chi-squared curve with k - 1 degrees of freedom, where k is the number of category probabilities. The tool always assumes that the number of probabilities is the number of categories, and if the probabilities you type in do not sum to 100%, it scales them so that they do, keeping their relative sizes the same.

The accuracy with which the chi-squared curve with k - 1 degrees of freedom approximates the histogram of observed values of the chi-squared statistic depends on the sample size, the number of categories, and the probability of each category. When the sample size is small, the observed histogram of sample values of the chi-squared statistic will tend to be irregular, and the corresponding chi-squared curve will not approximate the histogram very well. When the sample size is large, the observed histogram of sample values (in 10,000 samples) will be close to the chi-squared curve with k - 1 degrees of freedom, in the sense that the area under the histogram is approximately equal to the area under the curve for the same range of values. As rule of thumb, if the expected count in every category is 10 or greater (if n × pi ≥ 10 for all i = 1, 2, …, k), the chi-squared curve will be a reasonably accurate approximation to the histogram.

The chi-squared curve

We have just seen that the chi-square curve is an approximation to the probability histogram of the chi-squared statistic (when the null hypothesis is true). Like Student's t curve, the chi-squared curve is actually a family of curves, one for each value of the degrees of freedom. The chi-squared curve with k - 1 degrees of freedom is a good approximation to the probability histogram of the chi-squared statistic for k categories if the null hypothesis is true and the number of trials is sufficiently large that the expected number of outcomes in each category is 10 or larger.

If we think of the chi-squared curve with d degrees of freedom as a probability histogram, the expected value of the corresponding random variable would be d (the balance point of the curve is d) and the standard error of the random variable would be ( 2d )½.

below displays the chi-square curve and lets you find the area under the curve over an interval of values. When you first visit this page, the figure will show the chi-square curve with 3 degrees of freedom, and the range from 7.8 to 18 will be highlighted; you can change the degrees of freedom and the highlighted range.

Experiment by changing the number of degrees of freedom. As the degrees of freedom increases, the peak moves to larger and larger values, the balance point moves to larger and larger values, and the peak gets wider, but narrower relative to the balance point. As the number of degrees of freedom grows, the curve gets more nearly symmetric.

We can define quantiles of the chi-square curve just as we did quantiles of the normal curve and Student's t-curve: For any number a between 0 and 1, the a quantile of the chi-square curve with d degrees of freedom, xd,a, is the unique value such that the area under the chi-square curve with d degrees of freedom from minus infinity up to xd,a is a.

The following exercise checks your ability to use to find areas under the chi-square curve and quantiles of the chi-square curve. The exercise is dynamic: The questions will tend to change when you reload the page.

The Chi-square test for goodness of fit

At last we have the technology to solve the problem posed originally: to test the hypothesis that a set of categorical data were generated by a given multinomial probability model. Suppose that, under the null hypothesis, the data arise from n independent trials, each of which has probability p1 of resulting in an outcome in category 1, probability p2 of resulting in an outcome in category 2, … , and probability pk of resulting in an outcome of type k, where

p1 + p2 + … + pk = 100%.

Suppose further that

n × pi ≥ 10, for i = 1, 2, … , k.

Then, under the null hypothesis, the chance that the chi-square statistic exceeds x is very close to the area under the chi-square curve with k - 1 degrees of freedom above x. Therefore, if we reject the null hypothesis when the observed value of the chi-squared statistic is greater than xk-1,1-a, the chance of a Type I error (rejecting the null hypothesis when it is in fact true) will be about a.

In we return to the problem that motivated this chapter: testing whether a die is fair. The example is dynamic: the data tend to change when you reload the page, so you can see many examples of the computations.

This is the chi-square test for goodness of fit. The ingredients of the test are as follows:

Ingredients of the chi-square test for goodness of fit

Then, under the null hypothesis, the probability histogram of the chi-squared statistic,

chi-squared = sum of (oi - ei)2/ei

over all categories i = 1, 2, … , k, is approximated reasonably well by the chi-squared curve with k - 1 degrees of freedom.

The chi-squared test for goodness of fit is to reject the null hypothesis if the observed value of the chi-squared statistic is greater than xk-1,1-a, the 1-a quantile of the chi-squared curve with k-1 degrees of freedom, where a is the desired significance level. Under the assumptions given above, the significance level of this test is approximately a.

The P-value of the null hypothesis is approximately equal to the area under the chi-squared curve with k - 1 degrees of freedom, to the right of the observed value of the chi-squared statistic.

Note that we might reject the null hypothesis in a number of different situations:

The test cannot tell us which of these scenarios holds.

The following exercises check your ability to perform the chi-squared test for goodness of fit. The exercises are dynamic: the data tend to change when you reload the page.

Summary

The multinomial distribution is a common probability model for categorical data. It is a generalization of the binomial distribution to more than two possible categories of outcome. In an independent sequence of n trials, each of which has probability p1 of resulting in an outcome in category 1, probability p2 of resulting in an outcome in category 2, … , and probability pk of resulting in an outcome in category k, with p1+p2+ … +pk=100%, the numbers X1, X2, … , , Xk of outcomes in each of the k categories have a multinomial joint probability distribution: If n1+n2+ … +nk=n,

P(X1=n1 and X2=n2 and … and Xk=nk) = p1n1 × p2n2 × … ×pknk × n!/(n1n2!× … ×nk!).

The probability is zero if n1+n2+… +nkn. This is called the multinomial distribution with parameters n and p1, p2, … , pk. The expected number of outcomes in category i is n×pi.

The chi-squared statistic is a summary measure of the discrepancy between the observed numbers of outcomes in each category and the expected number of outcomes in each category:

chi-squared=(X1-E(X1))2/E(X1) + (X2-E(X2))2/E(X2) + … + (Xk-E(Xk))2/E(Xk).

The probability distribution of the chi-squared statistic when the null hypothesis is true depends on the number n of trials and the probabilities p1, p2, … , pk. However, if the number of trials is large enough that n×pi>10 for every category i=1, 2, … , k, the chi-square curve with k-1 degrees of freedom is an accurate approximation to the probability histogram of the chi-squared statistic. The chi-square curve with d degrees of freedom is positive, has total area 100%, and has a single bump (mode). The balance point of the chi-square curve with d degrees of freedom is d. The a quantile of the chi-square curve with d degrees of freedom, denoted xd,a, is the point for which the area to the left of xd,a under the chi-square curve with d degrees of freedom is a.

The chi-squared statistic and the chi-square curve can be used to test the null hypothesis that a given multinomial model gives rise to observed categorical data as follows: Let n be the total number of observations, and let k be the number of categories, let p1, … , pk be the probabilities of the categories according to the null hypothesis. Let X1 be the observed number of outcomes in category 1, let X2 be the observed number of outcomes in category 2, etc. Let

chi-squared = (X1-np1)2/(np1) + (X2-np2)2/(np2) + … + (Xk-npk)2/(npk).

If n×pi>10 for every category i=1, 2, … , k, the probability histogram of chi-squared when the null hypothesis is true can be approximated accurately by the chi-square curve with k-1 degrees of freedom, and the rule

Reject the null hypothesis if chi-squared>xk-1,1-a

is a test of the null hypothesis at approximate significance level a. The (approximate) P-value is the area to the right of chi-squared under the chi-square curve with k-1 degrees of freedom. This is called the chi-square test for goodness of fit.

Key Terms