Sampling

A central problem in statistics is to obtain information about a population, a collection of units, without examining every unit in the population—only a sample from the population. The ability to draw conclusions from a sample is essential if it is impossible or impractical to collect information about the entire population. Moreover, in some situations, sampling can lead to more reliable estimates and inferences than attempting to measure every unit in the entire population. The sampling design, the rules for deciding which units comprise the sample, is crucial to the accuracy and reliability of the results. Poor designs result in bias, a systematic tendency for estimators of population parameters using the sample to be too high or too low. Sample surveys—samples of the opinions of people obtained by interviewers, pollsters, or questionnaires—tend to suffer from nonresponse bias: those who respond to the survey often differ from those who do not respond with respect to variables that the survey seeks to study. Even if response is complete, some sampling designs tend to be biased. The best way to keep bias to a minimum is to use random sampling, which deliberately introduces chance into the selection of the sample from the population. The magnitude of the error (i.e., the uncertainty) of extrapolating to a population from a random sample can be estimated, while the error of extrapolating other kinds of samples is essentially impossible to estimate.

Parameters and Statistics

In our discussion so far of drawing from a box (or sampling from a population), we have known the contents of the box, and calculated the chance (exact or approximate) that the sum or average of the draws would be in some range. These are probability calculations. Now we turn to statistical estimation and inference, which work in the other direction: Starting with a sample from a box and information about how the sample was drawn, we can draw conclusions about the contents of the box (the population). The conclusions are subject to uncertainty, unless the sample is known to be the entire population.

Typically, we are interested in a numerical property of the population, called a parameter, and we base our estimates and inferences on the observed value of a quantity computed from the sample, called a statistic. These are two of the most important and fundamental problems Statistics addresses: How to estimate and make inferences about a parameter of a population, using a statistic computed from the sample.

A population is a collection of units, which could be people, things, places, times, temperatures, etc. A parameter is a numerical property of a population. For example, the population might be all the registered voters in the United States in October, 2003, and the parameter might be the fraction of those registered voters who were registered as Democrats. The parameter is what the statistician would like to know but does not; a statistic is something the statistician can calculate from things he knows (or will know once the sample is drawn), and could use to try to estimate the parameter.

A statistic that is calculated in order to estimate a parameter is called an estimator. For example, the sample mean is a common estimator of the population mean, and the sample percentage is a common estimator of the population percentage.

The following exercises check your ability to identify parameters and statistics.

Why Sample?

Estimating parameters from a sample instead of making measurements on an entire population is useful in many situations, including quality control, market research, predicting elections, etc. Sampling can be preferable to attempting to make measurements on an entire population, for a variety of reasons:

Sampling reduces the number of measurements that need to be made. This can
- allow more expensive measurement apparatus and better trained staff to be used, which can increase the precision and accuracy of the results
- save money, time, and resources
- make timely estimates possible
- be essential in destructive testing.
If the sample is designed well, one can quantify the uncertainty in parameter estimates based on the sample.

As an extreme example, if an automobile manufacturer wants to know how its cars fare in a crash, it cannot test all the cars it manufactures—it would have no cars left to sell. A lightbulb manufacturer who wants to know how long a certain type of bulb lasts cannot run all such bulbs until they burn out—it would have no bulbs left to sell. Using a product until it fails to see how long it lasts, crashing cars into barriers to see how crash-worthy they are, cutting through a casting to see whether it contains flaws, and similar methods for assessing product quality are called destructive testing. It is clearly desirable to be able to extrapolate to the whole population of bulbs, cars, or what have you, from a sample tested destructively.

In other cases, the cost (in time, manpower, or money) of making measurements on the whole population can be prohibitive. For example, in the U.S. Census, about one household in ten receives the long form. Whether results from a sample can be extrapolated to the population from which the sample was drawn depends critically on the design of the sample—how the sample was drawn—as we shall see in the next section.

Sample Surveys

Sample surveys attempt to determine the opinions, beliefs, behavior, or other parameters of a population of people from the responses of a sample from the population to a questionnaire or interview. Sample surveys are an important tool in Epidemiology, Social Science, and Economics, with applications ranging from estimating the prevalence of risky sexual behavior to market research. Sample surveys are subject to various biases, some of which are discussed in the following sections. One of the distinguishing characteristics of a survey is that the individuals are relied upon to report data about themselves, by answering questions. Because individuals could elect not to respond to a survey and could respond inaccurately or untruthfully, interpreting surveys requires care.

Hite Report on Women and Love, 1987

In the 1980's, Shere Hite sent out over 100,000 questionnaires to study how women feel about their relationships with men. Her findings were reported in Women and Love. The discussion of Hite's survey below is based on Sense and Nonsense of statistical Inference: Controversy, misuse, and subtlety (C. Wang, 1993), which reports and discusses some of Hite's data, and contrasts her findings with contemporary findings of other researchers.

Hite's results were at odds with the outcome of previous studies of sexuality in the U.S. For example, Hite claims to have proved Kinsey (1953) wrong—the Kinsey report found that 26% of women had had extra-marital affairs; Hite found that 70% had. Hite says

Kinsey's figures were just rather low because he was conducting face to face interviews … inhibited reporting …

Hite saw a "flood of unhappiness" in the responses she received.

In contrast, other studies at close to the same time as Hite's, including the Redbook survey and a 1991 survey of a random sample of 2000 people by Patterson and Kim, found that about 31% of women had extra-marital affairs. This figure is larger than Kinsey's, which might reflect changes in behavior from the 1950s to the 1990s, but it is still far lower than the rate Hite found. Hite also says that the following is representative of women in the US:

Feminists have raised a cry against the many injustices of marriage—exploitation of women financially, physically, sexually, and emotionally. This outcry has been just and accurate.

In contrast, a 1987 Harris poll of 3000 people selected at random found that family life gives great satisfaction to both men and women: 89% said their relationship was satisfying.

What is going on here?

All these studies—by Hite, Kinsey, Redbook, Patterson and Kim, and Harris—are sample surveys. Some used face-to-face or telephone interviews; others used written questionnaires. The fraction of U.S. women in 1991 who had had extramarital affairs is a parameter of the population of U.S. women. The fraction of women in the 1991 Patterson and Kim survey who had had extramarital affairs is a statistic intended to estimate that parameter. Hite claimed that her statistics were more accurate estimators of the corresponding parameters than Kinsey’s were, because her methodology was better: Using written questionnaires instead of face-to-face interviews made it more likely that respondents would answer truthfully, instead of fabricating a response out of shame, fear, or desire to please the interviewer. Hite cites the demographic breakdown of her sample as evidence that her results are accurate. The demographics of Hite's sample match the demographics of women in the United States at that time remarkably well. Tables present some summary statistics for the two groups.

Hite Sample versus US Population: (annual income)/1000
Stratum	Study%	US%
<$2000	19	18.3
$2000-4000	12.0	13.2
$4000-6000	12.5	12.2
$6000-8000	10.0	9.7
$8000-10,000	7.0	7.4
$10,000-12,500	8.0	8.8
$12,500-15,000	5.0	6.2
$15,000-20,000	0.0	9.8
$20,000-25,000	8.0	6.4
>$25,000	8.5	8.2

Area
Stratum	Study%	US%
City/Urban	60	62
Rural	27	26
Small Town	13	12

Region
Stratum	Study%	US%
Northeast	21	22
North Central	27	26
South	31	33
West	21	19

Race
Stratum	Study%	US%
White	82.5	83.0
Black	13.0	12.0
Asian	1.8	2.0
Hispanic	1.8	1.5
Native American	0.9	1.0
Middle Eastern	0.3	0.5

The demographics match those of the general US population remarkably well. Does that mean Hite is right?

Bias in Surveys

The more plausible conclusion is that Hite's results were biased by nonresponse: The conclusions are based on the responses of women who chose to fill out the survey and mail it back. It would not be surprising if angry, unhappy, dissatisfied women were more likely to respond to Hite's questionnaires than were women in generally fulfilling relationships. (Anger is a stronger motivator than happiness, in my experience.) In contrast, it would be rather surprising if the large differences between Hite's results and those of the other surveys were due to dishonest responses to the other surveys.

Only 4,500 of the (over) 100,000 women who were sent questionnaires responded (4.5%). Roughly 95,500 of the people from whom a response was solicited did not respond to the survey: They are called nonresponders. The nonresponse rate in the Hite survey is thus (at least)

number of nonresponders		100,000 − 4,500		95,500
-----------------------------	=	--------------------	=	-------------	=	95.5%.
responses solicited		100,000		100,000

This is a very high rate of nonresponse. The 95,500 nonresponders might be quite different from the 4,500 responders with respect to the issues the survey sought to explore, even though the 4,500 responders were much like the general population of women with respect to basic demographic characteristics, such as age, ethnicity, income, and place of residence. If so, the difference between the responders and the nonresponders could bias the results. This is called nonresponse bias. Even if the 100,000 women who were sent surveys were a representative subset of all american women, the 4,500 who responded might not be. In fact, that is a plausible explanation for the differences between the results of the Hite report and the results of the other studies cited.

We can bound the extent to which nonresponse affects estimates of percentages by imagining what would have happened had all the nonresponders answered one way or the other. For example, if the nonresponders had all had extramarital affairs, the estimated rate of extramarital affairs would have been

( 70%×(4,500 responders) + 100%×(95,500 nonresponders) )/100,000 = 98.7%.

On the other hand, if none of the nonresponders had had an extramarital affair, the estimated rate of extramarital affairs would have been

( 70%×(4,500 responders) + 0%×(95,500 nonresponders) )/100,000 = 3.2%.

The nonresponse could have biased the estimate upwards by (70% − 3.2%) = 66.8%, or could have biased the estimate downwards by (98.7% − 70%) = 28.7%.

The following exercise checks your ability to calculate the possible size of nonresponse bias.

The Hite report might be evidence that women's dissatisfaction in their relationships with men (and perhaps infidelity) is fairly evenly distributed throughout our society; but, primarily because of nonresponse, the survey says very little about the prevalence of dissatisfaction and infidelity. Proportional representation in the sample of various subgroups of the population does not ensure that the sample is like the population with respect to the variables of interest. In the Hite survey, it is reasonable to conclude that those women who responded to the questionnaire were not typical of the general population of U.S. women with regard to the variables being studied, even though they were like the general population with regard to various demographic variables: the factors that influence whether a woman responded to the Hite survey probably are strongly associated with the variables the survey studied.

There are many other potential sources of bias in sample surveys. The following list summarizes some of them:

Frame bias. As discussed later, the sampling frame from which individuals are selected to participate in a survey can differ from the general population with regard to the variables studied. A classic example of this is the 1936 Literary Digest poll of 10 million people, which attempted to predict the outcome of the 1936 presidential election. The poll predicted that Landon (Republican) would beat Roosevelt (Democrat) by a margin of more than two to one; instead, Roosevelt won. The 10,000,000 people in the poll were subscribers to Literary Digest and people with telephone service. In those days, both were relatively expensive, and the people on the list were predominantly Republican. The sample was not representative of the voting population; it was biased in favor of the Republican candidate. This is an example of a sample with built-in bias. See How to Lie with Statistics (Huff, 1993) for more examples.
Question wording. The wording of a question can influence the response enormously. For example, which of the following do you think would elicit more "pro-life" responses:
- Should a woman have control over her own body, including her reproductive system?
- Should a doctor be allowed to murder unborn children who cannot defend themselves?
Sensitive topics. It is hard to get honest answers to questions on sensitive topics, including income, sex, hygiene, and nefarious or illegal behavior, such as tax evasion or cheating on exams. Hite probably was correct that the anonymity of her survey increased the truthfulness of responses to questions about infidelity: Women who had extramarital affairs were more likely to report so than they would have been had they been asked directly by a researcher.
Interviewer bias. Interviewers can bias the results consciously or unconsciously, in a variety of ways. For example, if the interviewer has any discretion in selecting subjects to interview, the sample is likely to be biased towards persons the interviewer finds approachable or appealing. The interviewer's appearance and demeanor affect people's willingness to respond. Also, most interviewees unconsciously seek to please the interviewer. The appearance and demeanor of the interviewer influence what subjects imagine the interviewer to be thinking. A man with a military haircut wearing a business suit asking whether there should be more homeless shelters probably would get different responses than would a shabbily dressed woman with an infant in her arms.

See Huff (1993) for more discussion and more examples. Hite was concerned primarily with the third and fourth of these sources of bias: People tend to color their responses when the topic is sensitive, and when they are worried that the interviewer is judging them, or that their responses could have negative repercussions for them. There are techniques that try to reduce this bias, but it is hard to eliminate completely. Conscious or unconscious bias in selecting subjects to interview can be eliminated by selecting subjects at random.

Sampling Designs

This section introduces terminology and a taxonomy of sampling designs, strategies for drawing a sample from a population to draw inferences about a population of people or other units from a sample. The terminology and classification apply not only to sample surveys, but to sampling generally.

The sample size is the number of units the sample contains. Commonly, the sample is chosen not from the entire population of units, but from a subset of the population, or a different population, called a frame or sampling frame. If the value of the parameter differs for the population and the frame, this can introduce frame bias into estimates computed from the sample, as described earlier in this chapter. Throughout the remainder of this chapter, n will denote the sample size and N will denote the number of units in the frame.

Cluster Sampling

Sometimes the units are not sampled directly, one at a time; rather, clusters of units are selected. The basic element sampled is called the sampling unit. The sampling units can be the same as the population units, but it is quite common for the sampling units to be clusters of population units; then the sample is called a cluster sample.

Stratified Sampling

Sometimes the population is divided into non-overlapping groups, called strata (singular: stratum), and a sample is taken separately from each stratum. This is called stratified sampling; the sample is called a stratified sample. If the variable of interest varies within the population in a way that is associated with membership in different strata, stratified sampling can yield smaller errors than simpler sampling designs, for a given sample size. However, it is generally harder to quantify the errors rigorously than it is for a sample drawn at random directly from the entire population (e.g., a simple random sample or a random sample with replacement).

Multistage Sampling

Sometimes it is easier to draw the sample in stages. For example, to draw a sample of persons in the United States, we might proceed as follows:

Make a list of states.
Select two states; then list the counties in those states.
Select two counties from each of the states, and list the blocks in the counties.
Select three blocks from each county, and list the housing units in the blocks.
Select five housing units from each of the blocks, and list the residents of the housing units.
Select one person from each of the selected housing units.

Such an approach is called multistage sampling. It can be much more economical than trying to sample directly from the population. In this example, the multistage approach needs a list of persons in a few dwellings, a list of dwellings in a few blocks, a list of blocks in a few counties, a list of counties in a few states, and a list of states. Constructing or obtaining these lists is much easier and much less error-prone than trying to construct a list of all persons in the United States, which we would need to sample persons directly in a single stage.

Hybrid Sampling Designs

There are also more complicated sampling designs that combine these strategies. For example, in the description of multistage sampling in the previous section, had we taken all the residents of the selected housing units to be in the sample, instead of just one resident of each housing unit in the sample, we would have been sampling clusters of people instead of individuals. This is called multistage cluster sampling.

Ways to Draw Samples

Whether we sample individuals or clusters of individuals, sample in one stage from the frame or in several stages, sample from the frame as a whole or from strata separately, we need to pick the sampling units that comprise the sample. How the sampling units are selected is crucial in determining whether the sample is representative, and whether it is possible to quantify the uncertainty estimates of parameters of the population based on the sample.

Convenience Samples

A convenience sample is a sample that consists of units that are readily accessible to the investigator or data collector—units it is convenient to examine. The data collector has complete latitude in deciding which units to include in the sample. For example, suppose I seek to estimate the fraction of University of California at Berkeley faculty who are registered Republicans. I might just start at the beginning of the campus telephone directory, and call faculty until I reach 100 people who are willing to answer my question. Alternatively, I might go to the Faculty Club at lunchtime and interview the first 100 faculty who consent. Either would be a convenience sample. A convenience sample typically is not representative of the population, and usually it is not possible to quantify the error in extrapolating from a convenience sample to the entire frame or population. For example, in the second case, if the membership of the Faculty Club is disproportionately Republican compared with the faculty at large, the sample would tend to be biased. The Hite study essentially used a sample of convenience.

Quota Samples

A quota sample is a sample picked to match the population with respect to some summary characteristics. For example, in an opinion poll, one might want the proportions of various ethnicities in a sample to match the proportions of ethnicities in the overall population. Like convenience samples, quota samples leave latitude for the data collectors to select the individuals who will comprise the sample—subject to the constraint that the summary characteristics of the sample match their target values. Generally, this is a bad idea: Unconscious biases in the interviewers' selections can have strong effects on which individuals end up in the sample, which can cause the sample to be unrepresentative of the population with respect to the variable being studied. This is called selection bias. As with a sample of convenience, usually it is not possible to quantify how closely representative of the population a quota sample is likely to be. Although the Hite study did not use quota sampling, the demographics of its sample matched the demographics of the population extremely well, but the results did not seem to be representative of the population (if the contemporaneous studies using random samples are trustworthy). This shows that matching some characteristics of the sample to characteristics of the population does not guarantee that the sample is representative of the population with regard to the variables of interest.

Systematic Samples

A systematic sample results from taking every kth unit from the frame, where k is chosen to give a sample of the desired size. For example, if there are 20,000 units in the frame and we want a sample of size 100, we would take every 200th unit to be in the sample. To take a systematic sample, the units in the frame have to be listed in some order. If the order is essentially haphazard, a systematic sample behaves much like a simple random sample, described later in this chapter. If the order of the units in the list is related to the value of the variable under study, a systematic sample can be biased. Systematic samples are cluster samples—the frame is divided into k clusters, and one of those k clusters comprises the sample. It is difficult to quantify the error that results from using a systematic sample. Systematic samples do not leave latitude for the data collector to select the units that comprise the sample, which can reduce deliberate and unconscious biases compared with convenience samples and quota samples.

Probability Samples

A probability sample is a sample drawn using a random mechanism to select the units from the frame to comprise the sample; that is, whether each unit in the frame is in the sample is a random event, with a specified probability. In a probability sample, one can specify ahead of time (before the sample is drawn) the chance that each unit in the frame will end up in the sample. The probability of being selected need not be the same for every unit in the frame. A statistic computed from a probability sample is a random variable, because its value depends upon which units happen to be in the sample, and those units are chosen randomly. In a probability sample, the person collecting the data has no discretion in about which units to include in the sample, so deliberate and unconscious biases cannot affect the choice of units. As a result, probability samples tend to be more representative of the general population than convenience samples and quota samples are, if the probability of drawing each unit is the same. Moreover, it is possible to quantify the error of estimators computed from a probability sample, which is not possible for samples of convenience, quota samples, or systematic samples.

Simple Random Samples

A simple random sample of size n from a frame containing N units is a probability sample drawn in such a way that every subset of n of the N units in the frame is equally likely to be the sample. (Each of those subsets has probability 1/_NC_n). This is like writing a unique identifier for each unit in the frame on one of N otherwise identical cards, shuffling the cards well, then dealing the top n cards. Equivalently, taking a simple random sample is like writing an identifier for each of the N units in the population on N otherwise identical tickets, putting the tickets into a box, stirring them vigorously, and drawing n of the tickets without looking; then considering the sample to consist of those units whose identifiers were on the tickets drawn. Conceptually, a simple random sample is a sample drawn without replacement as follows: In the first step, each of the N units is equally likely to be drawn. At the second step, each of the N−1 remaining units is equally likely to be drawn, etc., until, at the nth step, each of the N−n+1 remaining units is equally likely to be drawn. After n steps, we have a simple random sample of size n. In practice, simple random samples are drawn using a computer to generate pseudo-random numbers, as follows: Each unit in the population is assigned (independently) a random number between zero and one. The sample consists of those units that were assigned the n largest random numbers. If there are ties, they are broken randomly and independently (e.g., by tossing a coin). In simple random sampling, the chance that any particular unit in the frame is included in the sample is n/N

Systematic Random Samples

Suppose for the moment that n, the sample size, is a divisor of N, the frame size, so that N=n×k, where k is an integer. Assume that the elements of the frame are listed in some order. If we took every kth element of the frame, we would get a systematic sample of size n, as described above. If we pick a random number K between 1 and k, uniformly, and take every kth element in the list, starting with the Kth element, we get a systematic random sample of size n. In this example, we can think of the frame as n groups of k units. The systematic sample picks the first element of each of these groups. The systematic random sample picks the Kth element of each of these groups, where K is random, but the same for every group. Systematic random samples have the following characteristics:

Every element in the frame has an equal chance of being in the sample.
Not every subset of size n of the frame has the same chance of being the sample. In particular, the sample must be one of k subsets, which are equally likely. In contrast, a simple random sample can be any of the _NC_n possible subsets of n units from the frame, and those subsets are equally likely.

For example, suppose the frame consists of N=100 elements, listed in some order, and we seek a sample of size n=10. If we take every k=10th element, that will yield a sample of the size we desire. Suppose we start with the Kth element, where K is chosen at random and is equally likely to be 1, 2, …, or 10. Then there are ten equally likely samples, consisting of elements

The 10 equally likely samples for systematic random sampling of 10 elements from 100
sample 1	1, 11, 21, 31, 41, 51, 61, 71, 81, 91
sample 2	2, 12, 22, 32, 42, 52, 62, 72, 82, 92
sample 3	3, 13, 23, 33, 43, 53, 63, 73, 83, 93
sample 4	4, 14, 24, 34, 44, 54, 64, 74, 84, 94
sample 5	5, 15, 25, 35, 45, 55, 65, 75, 85, 95
sample 6	6, 16, 26, 36, 46, 56, 66, 76, 86, 96
sample 7	7, 17, 27, 37, 47, 57, 67, 77, 87, 97
sample 8	8, 18, 28, 38, 48, 58, 68, 78, 88, 98
sample 9	9, 19, 29, 39, 49, 59, 69, 79, 89, 99
sample 10	10, 20, 30, 40, 50, 60, 70, 80, 90, 100

Each unit is in exactly one of these possible samples, so the chance any particular element is in the sample is 1/10 = 10%. The chance that the sample consists of elements {1, 11, 21, …, 91} is 1/10. The chance that the sample consists of elements {1, 2, 3, …, 10} is zero.

Systematic random sampling is better than systematic sampling, but usually is not as good as simple random sampling, and is not much easier to implement. Systematic random sampling is in effect random cluster sampling: In the previous example, there were 10 clusters in all. The probability of drawing each cluster is 1/10. Because every cluster contains the same number of units and the chance of drawing each cluster is the same, the chance each unit is in the sample is the same. (However, because only the clusters are possible samples, the chance each subset of 10 elements comprises the sample is not the same.) More generally, it is possible to draw cluster samples randomly in such a way that every unit has the same chance of being in the sample; one simply makes the probability of selecting each cluster proportional to the number of units in the cluster. However, unless each cluster contains only one unit, not every subset of n units is equally likely to be the sample, so typically cluster sampling cannot yield a simple random sample.

To illustrate the differences among sampling designs, suppose we wish to estimate the average size of undergraduate classes at the University of California, Berkeley, in the current semester. Consider seven approaches:

Take a random sample of 100 courses from the current course schedule and average their sizes.
Take a random sample of 50 students, list the courses each student is taking, and average the list.
Take a random sample of 50 instructors, list the courses they are teaching, and average the combined list.
Take separate random samples of 5 courses from each department in the university, and average the sizes of the courses in all the samples.
Take a random sample of 5 science/engineering departments and a random sample of 5 humanities/professional departments; for each department in the sample, list the sizes of all the courses. Average the sizes of the courses in the lists.
Take a random sample of 5 science/engineering departments and a random sample of 5 humanities/professional departments. For each department in the sample, take a random sample of 5 instructors. Average the sizes of the courses those instructors are teaching in the current term.
Pick a random number K between 1 and 10. Starting with the Kth course in the course schedule, list the size of every 10th class. Average the list of sizes.

The population we wish to sample is the set of courses taught in the current semester. In the first approach, the sampling frame is the same as the population, and the sampling units are the population units. Drawing a sample in this case is like drawing tickets from a box of numbered tickets, with one ticket for each course. The size of the course is written on the ticket.

In the second approach, the sampling units are clusters of courses: all courses taken by a single student. The frame is a collection of lists of course sizes, one for each enrolled student. The size of every course with at least one student enrolled is written on at least one of the lists in the frame. The size of each course has a chance of being represented in the sample, but different courses have different chances: the larger the enrollment, the larger the chance. In the second approach, drawing the sample is like drawing tickets from a box of numbered tickets, with several numbers on each ticket. There is one ticket for each student; written on the ticket are the sizes of the courses that student is enrolled in. The sizes of courses with large enrollments will appear on many more tickets than the sizes of courses with small enrollments. The average size of courses taken by students will tend to be higher than the average size of courses overall.

In the third approach, the sampling units again are lists of course sizes, one list for each instructor who is teaching a course. Drawing the sample is like drawing tickets from a box with one ticket for each instructor, with the sizes of the courses taught by the instructor written on it. The size of each course appears on only one ticket, barring ties. Each distinct class size appears on as many tickets as there are classes of that size. Taking the set of tickets as a whole, there is one number for each class. The average of the numbers on the tickets in the sample is an unbiased estimate of the average class size in the first and third approaches, but is a biased estimate for the second approach. This is another example of a sample with built-in bias; see Huff (1993) for more examples. The second and third approaches are examples of cluster sampling.

The fourth approach is an example of stratified sampling. The strata are department course offerings. Averaging the sizes of the courses in the sample typically would give a biased estimate of the average class size, unless the probability that each department is selected is proportional to the number of courses that department offers, or the number of courses drawn from a given department is adjusted to compensate. Otherwise, drawing the sample is like drawing from a box of numbered tickets, with one number on each ticket, but a different number of tickets for courses in different departments. For example, if simple random sampling were used to draw the sample of departments, and then simple random samples of the same size were drawn from each department selected at the first stage, the box would have fewer tickets per course for departments that offer many courses than for departments that offer few courses.

The fifth approach is an example of stratified cluster sampling: There are two strata, the humanities and professional courses (as a collection), and the science and engineering courses (as a collection). The sampling units are departments, and the clusters are lists of sizes of all courses offered by each department.

The sixth approach is an example of stratified multistage cluster sampling: There are two strata, the humanities and professional courses, and the science and engineering courses. The first stage of sampling is to select departments; the second stage is to select instructors. Instructors correspond to clusters of courses.

The seventh approach is an example of a systematic random sample.

The following exercise checks your ability to identify a sampling design from a verbal description. This exercise is dynamic—the wording will tend to change when you reload the page.

Sampling from Hypothetical Populations

Sometimes the population itself is hypothetical or fictitious. For example, consider administering an achievement test to a random sample of 10th grade students from a particular high school, and averaging the results. We might think of this as sampling test scores to try to estimate the average achievement score of 10th graders at the school. Under closer inspection, this gets squishy: The 10th graders at the school do not all have scores—only those who were tested have scores. To tighten up the logic, we need to imagine an hypothetical population: We seek to estimate the average of the population of scores that would have existed had every 10th grader in the school been tested. This population is fictitious—it does not exist.

Here is an even more extreme example: Suppose we wish to estimate the effectiveness of an SAT preparatory course. We might imagine drawing a random sample of high-school seniors. For each senior in the sample, we toss a coin. If the coin lands heads, the student takes the preparatory course; otherwise, the student does not. We compare the average SAT scores of the students who took the course with the average of those who did not. We can think of the difference of the averages as an estimate of the effect of the preparatory course. As in the previous example, clarity recedes when we look closely at the procedure. The idea is that we are estimating what the average score of students who did not take the preparatory course would have been if they had taken the preparatory course, and we are estimating what the average scores of the students who took the preparatory course would have been had they not taken the course. Moreover, we are estimating this even for students who did not take the SAT at all!

The ideas can be tightened up by imagining an hypothetical population of tickets, one for each student. Each student's ticket has two numbers written on it: the SAT score he or she would get without taking the preparatory course, and the SAT score he or she would get having taken the preparatory course. For each student, we generate a random number with the possible values {0, 1, 2}. If the value of the random number is 0, the student's ticket is not drawn, and we do not observe any number on the student's ticket. If the value of the random number is 1, the student's ticket is drawn, and we observe the SAT score the student would get without taking the preparatory course. If the random number is 2, we observe the SAT score the student would get after taking the preparatory course.

Consider the difference between the average score observed for students for whom the random number was 2 and the average score observed for students whose random number was 1. That difference is an estimate of the difference between the average of the list of SAT scores that would have been available had all the students taken the preparatory course and the SAT, and the average of the list of SAT scores that would have been available had none of the students taken the preparatory course, but all taken the SAT. Both lists are hypothetical and counterfactual: It is impossible for both lists even to exist, because every student would have had to take the preparatory course and not to take the preparatory course. Nonetheless, we can think of that difference of averages of hypothetical lists as the average effect of taking the preparatory course. The difference of observed averages is an estimate of that difference. If every student has the same chance of being assigned 0, 1, or 2 as his or her random number, this difference of observed averages is an unbiased estimate of the difference of averages of the hypothetical lists. The assumption that different students have the same chance of taking the SAT and of taking the preparatory class is not very reasonable, however: Students who would do well on the SAT probably have a greater tendency to take the SAT than students who would do poorly—they are the students destined for college. Students who take the preparatory classes might be the particularly motivated and diligent students, who would tend to do better than average on the SAT even if they did not take the preparatory class. It is very hard to estimate the sizes of these confounding effects.

The following exercise checks your ability to determine whether a sampling design is biased.

Summary

A parameter is a numerical property of a population, which is a collection of units. A sample is a collection of units from a population; the number of units in the sample is called the sample size. A statistic is a number computed from a sample. Statistics typically are used to estimate parameters; such statistics are called estimators. Depending on circumstances, collecting data about a sample of units instead of an entire population of units can be the only way, the most economical way, or the most accurate way to estimate the value of a population parameter. A sample survey is a sample of opinions or other properties of people that relies on the individuals in the sample to provide the data, for example, through interviews or by filling out a questionnaire. Sample surveys tend to suffer to various extents from nonresponse bias, which is caused by systematic differences in the variables studied between the individuals who participate willingly in the survey and those who elect not to participate. If the nonresponse rate, the fraction of people who refuse to participate, is large, the nonresponse bias can be large, and the results of the survey should not be trusted. Even people who participate willingly in a survey might not answer questions about sensitive subjects truthfully. There are methods, such as randomized response, that encourage truthful reporting by guaranteeing anonymity.

Samples can be drawn from populations in various ways. Rarely is the sample drawn from the entire population; usually, the sample is drawn from a frame of units that are accessible or that can be listed. Sometimes, instead of drawing units one at a time, clusters of units are drawn from the frame; this is called cluster sampling. The fundamental unit of sampling is called the sampling unit, which can be an individual unit of the frame or a cluster of units. Sometimes, instead of drawing sampling units from the frame as a whole, the frame is partitioned into strata and samples are drawn separately from each stratum; this is called stratified sampling. The samples need not be drawn in one step: The frame can be divided hierarchically and the sample drawn in stages. Such multistage sampling is particularly useful when it is difficult or impractical to list the entire frame, but easier to list the elements at each level of the hierarchy. Systematic sampling consists of listing the units in the frame in some order, then taking every kth unit in the list. Systematic sampling is a special case of cluster sampling. Cluster sampling, stratified sampling, and multistage sampling can be combined in various ways to give hybrid sampling designs, including multistage stratified cluster sampling. Ultimately, which sampling units comprise the sample needs to be decided in some way.

A convenience sample selects sampling units on the basis of their accessibility to the investigator or data collector. The data collector has complete latitude to select the units that comprise the sample. Convenience samples tend not to be representative of the frame, and the extent to which they fail to be representative is hard to quantify. A quota sample selects sampling units in a way that guarantees that some summary statistics of the sample match the corresponding parameters of the population or the frame, but within that constraint, leaves considerable latitude in the selection of sampling units. Quota samples also tend to be unrepresentative: Matchiing summary statistics does not ensure that the sample is like the population or frame with respect to the variables studied. It is hard to quantify the extent to which a quota sample fails to be representative of the population or frame.

Probability samples are samples drawn using a random mechanism for which it is possible to specify the chance that any particular unit will be in the sample. Probability samples do not allow the investigator or data collector any choice in the units to include. Eliminating the freedom to choose tends to reduce conscious and unconscious biases in the data collection. Two common ways to draw probability samples are simple random sampling and systematic random sampling.

Simple random sampling is random sampling without replacement: Every subset of n units from the frame is equally likely to be the sample. In simple random sampling, the chance that any unit in the frame is in the sample is the same. In simple random sampling where the frame is identical to the population, it is possible to quantify the probable error in an estimate of a population parameter computed from the sample.

Systematic random sampling consists of taking every kth element of the frame to be in the sample, starting at a random place in the list of units in the frame. Systematic random sampling can behave much like simple random sampling, but tends not to be as representative, especially if there is structure in the listing of the frame. Systematic random samples can be designed so that the chance that each unit in the frame is in the sample is the same, but not so that the chance that each subset of n units in the frame comprises the sample is the same. For that reason, simple random sampling is better than systematic random sampling. Some experiments involving comparing groups of subjects who receive different treatments can be thought of as sampling from hypothetical populations.

Key Terms

average
bias
cluster sample
convenience sample
estimator
frame
independent
multistage sampling
nonresponse bias
nonresponse rate
parameter
population
population mean
population parameter
probability
probability sample
quota sampling
random sample
random variable
sample mean
sample mean
sample size
sample survey
sampling frame
sampling percentage
sampling unit
selection bias
simple random sample
statistic
stratified sample
stratum/strata
systematic sample
unbiased
unit