Suppose you want to know how many fish there are in a (small) pond. You could try to catch all of them, but you could never be completely sure that you had. There is a method called capture-recapture that estimates the number of fish in a different way, using simple random sampling twice.
Here is the idea. You go to random places in the pond and catch 10 fish (without replacement), tag them, and release them. Some time later, after the fish have had a chance to mix with those who were not caught, you go to a set of random places in the pond, and catch 10 fish (without replacement), and record the number of those fish that were tagged. The crucial assumptions are that
The assumption basically is that the fish behave like tickets in a box, but tickets that mix themselves up by swimming around. At the second stage, one thinks of the fish that were tagged at the first stage as tickets labeled "1," and the fish that were not tagged as tickets labeled "0."
Suppose there really are N = 50 fish in the pond. At the second stage, the fraction of the fish caught that were previously tagged would have an hypergeometric distribution with parameters N = 50 (the number of fish in the pond), n = 10 (the number of fish caught the second time), and G = 10 (the number of tagged fish in the pond). We actually know n and G. Once we catch the fish the second time, we will know g, the number of tagged fish in the second sample. How might we estimate N?
We can make an analogy with the rule in Chapter 11 for deciding which box we were drawing from in the "Let's Make a Deal" problem. There we considered drawing with replacement from one of two possible boxes. (One of the boxes had one ticket labeled "1" and one ticket labeled "0;" the other had two tickets labeled "1" and one ticket labeled "0." Here we might be drawing from any one of an infinite number of boxes. The number of tickets labeled "1" in each possible box is 10, the number of tagged fish. The number of tickets labeled "0" is at least the number of untagged fish caught at the second stage, but could also be any larger number. In Chapter 11, we decided we were drawing from whichever of the two boxes made the data more likely, that is we decided we were drawing from Box 1 if
P(drawing observed number of tickets labeled "1" from Box 1)
was greater than
P(drawing observed number of tickets labeled "1" from Box 2).
Here, we have an infinite number of possible boxes, but we can use the same rule: decide we are drawing from which ever box would make drawing the observed number of tickets labeled "1" most probable. This rule is called the maximum likelihood estimate of the number of tickets in the box.
In Chapter 11, we were drawing with replacement; here we are drawing without replacement, but the principle still makes sense. The chance of drawing g tickets labeled "1" in a simple random sample of n tickets from a box with N tickets in all, of which G are labeled "1" and N-G are labeled "0" is given by the hypergeometric distribution:
GCg × N-GCn-g | ||
P(draw g tickets labeled "1") | = | --------------------- . |
NCn |
A reasonable rule for estimating N (deciding which box we are drawing from) would to pick N so that the probability of observing what we do observe is as large as possible. That is called the maximum likelihood estimate of N. Maximum likelihood estimates are used very widely in Statistics, but they are not the only ones used widely.
We can work out what the maximum likelihood estimate would be: if we expand the combinations into expressions in terms of factorials, we get
G!(N-G)!(N-n)!n! | ||
P(draw g tickets labeled "1") | = | ---------------------------------- |
g!(G-g)!(n-g)!(N-G-n+g)!N! |
(N-G)!(N-n)! | ||
= | constant × | --------------------- |
(N-G-n+g)!N! |
(N-G)×(N-G-1)×(N-G-2)× … ×(N-G-n+g+1) | ||
= | constant × | ---------------------------------------------------- . |
N×(N-1)×(N-2)× … ×(N-n+1) |
This is largest when N = n×G/g (suitably rounded to an integer … ).
The U.S. Census, mandated by the Constitution, tries to enumerate all inhabitants of the United States every ten years. State and sub-state counts matter for apportioning the House of Representatives, allocating Federal funds, congressional redistricting, urban planning, and so forth.
The Census makes two kinds of errors--gross omissions (GOs) and erroneous enumerations (EEs). A GO is a failure to count a person in the block containing his or her usual place of residence as of census day; an EE results from counting a non-existent person, counting someone in the wrong block, or counting someone twice. Generally, GOs slightly exceed EEs. This net undercount is uneven, demographically and geographically. For example, the 1990 census missed blacks about four times as often as non-blacks.
The Census Bureau tried to adjust the 1980 and 1990 censuses using "dual system estimation" (DSE). DSE involves taking a sample of blocks after the census, enumerating the residents of those blocks (Post-Enumeration Survey, PES), trying to match PES records to census records, inferring undercount within demographic groups, and extrapolating to all blocks in the country, assuming that undercount rates are constant across geography within those demographic groups. The keys to DSE are matching PES records to census records accurately and the constancy assumption, not counting better. In 1980, the Bureau did not adjust, because missing PES data made the uncertainty of the adjustment too big. In 1990, the Bureau sought to adjust, but Secretary of Commerce Mosbacher overruled the Bureau, finding their technical justification inadequate. The Clinton Administration plans to adjust the 2000 census. The 2000 version of DSE, the sampling and estimation procedure, is called Accuracy and Coverage Evaluation (ACE). The sampling design for ACE is described in more detail below, but here is an overview.
The population from which the Bureau of the Census ideally would like to sample is the population of U.S. residents as of Census Day. Because we do not have a list of U.S. residents, that cannot be done directly. What the Census Bureau does, in sketch, is to take a random sample of blocks of different types (urban, rural, Indian Reservation, … ), then try to list all the housing units in those blocks, then try to list all occupants of those housing units. This is a multistage stratified cluster sample. The frame for sampling blocks is a well defined list of blocks within the U.S., but the frame for sampling individuals is not well defined: it involves hypothetical counterfactuals such as lists of people that would have been compiled had certain blocks not in the sample of blocks in fact been in the sample of blocks. If the list of blocks omits any blocks where people live, the implicit frame for people does not contain the entire population.