Philip B. Stark, 1 March 2010 (last updated 27 March 2010)
Post-election audits count the votes in randomly selected batches of ballots by hand. They compare the result of the hand counts to the final-but-for-the-audit results (the apparent results) for those same batches. The information an audit provides about the rate of vote-tabulation errors—and hence about the accuracy of the vote-tabulation system and the electoral outcome—depends on the sizes of the batches from which the audit sample is drawn. Smaller batches give more information for the same amount of hand counting.
When the apparent outcome of an election is indeed correct, the amount of hand counting required to confirm so statistically is roughly proportional to the size of the auditable batches of ballots. For instance, if batches all consist of 1,000 ballots, the number of ballots that need to be counted by hand to confirm the electoral outcome is about 1,000 times larger than if each of those batches were subdivided into 1,000 batches consisting of single ballots. This note gives two heuristic explanations for the relative efficiency of smaller batches: estimating the number of of coconut jelly beans in 25lbs of assorted jelly beans and estimating the amount of salt in 1,200 ounces of soup stock.
Post-election audits serve many purposes, including quality control and improvement, fraud deterrence and detection, and verifying electoral outcomes. The goal of a risk-limiting post-election audit is to ensure statistically that the final, official electoral outcome of a contest is correct; that is, a probabilistic assurance that the official outcome is the same outcome that a full hand count of the audit trail would show. (The electoral outcome that a full hand count of the audit trail would show is, by definition, the correct outcome.)
I use the term apparent outcome to mean the electoral outcome that will become official unless the audit intervenes. It is the electoral outcome—the set of winners—after all the votes have been tabulated, including votes cast in person, votes cast by mail, and votes cast provisionally. If the audit gives strong statistical evidence that the apparent outcome is right, the hand counting can stop and the apparent outcome is reported as the final official outcome. If not, the counting continues until either there is strong statistical evidence that the apparent outcome is right, or until the entire audit trail has been counted by hand. If the entire trail is counted by hand, the correct electoral outcome is then known and can be reported as the final official outcome.
An auditing procedure is risk-limiting if it has a known minimum chance of progressing to a full hand count whenever the outcome is wrong, no matter what caused the outcome to be wrong. The risk is the maximum chance that the audit does not progress to a full hand count when the apparent outcome is in fact wrong. (The maximum is over all ways in which the outcome might be wrong, including the possibility of programming errors, voter errors, and deliberate fraud.) If a risk-limiting audit stops without a full hand count, then either the electoral outcome is in fact correct, or something very unlikely occurred—something that has a chance no larger than the risk limit.
If the apparent outcome is wrong, the true margin of some apparent winner over some apparent loser is actually negative, even though the apparent margin is positive. That could occur, for example, if enough ballots were interpreted by the vote tabulation system as overvotes or votes for an apparent winner when a hand inspection would show that they were cast for an apparent loser.
Of course, there might be offsetting differences that deflated the apparent margin compared to the margin that a full hand count would show. For instance, the original tabulation might find an undervote when a hand inspection of the same ballot would show a vote for an apparent winner. For the apparent outcome to be wrong, the differences that inflated the apparent margin net of the differences that deflated the apparent margin must be bigger than the apparent margin. (In contests with more than two candidates, there's some additional bookkeeping, but the idea is the same.)
A risk-limiting audit assesses whether the true margin is positive; that is, whether the apparent winners are the real winners. If the hand count of the audit trail in the batches selected at random for audit gives strong statistical evidence that the true margin is positive, that constitutes evidence that the apparentwinner really won. Then the audit can stop. If not, hand counting continues. Eventually, either there is strong statistical evidence that the true margin is positive, or there has been a full hand count, which reveals the true electoral outcome.
The strength of the evidence that hand counting a random sample of ballots gives about the true margin depends on many things, including the apparent margin, the discrepancies between the hand count and the apparent subtotals for the batches in the sample, the number of ballots in the sample, the sizes of the batches from which the sample is drawn, the way the random sample is drawn (the sampling deisgn), and other variables. This note concentrates on the effect of the sizes of the batches.
Post-election audits are generally based on drawing random samples of batches of ballots (or auditable records, such as voter-verifiable paper audit trails, VVPATs). Before drawing the audit sample, all the ballots in the contest are divided into batches, and vote subtotals for each batch are determined and reported (or, if not reported, committed to irrevocably; see below). The way batches are defined matters: The amount of evidence the hand count gives depends not only on how many ballots are counted but also on how those ballots were selected.
For instance, suppose that 50,000 ballots were cast in a contest, 500 ballots in each of 100 precincts. Consider three ways of drawing 500 ballots to tally by hand:
The first approach gives much less reliable information about the vote tabulation errors in the contest as a whole than the last approach does. The second is intermediate. I hope that the two food examples below help explain why. For more information about sampling, see the relevant chapters of SticiGui.
In current election audits, a batch typically consists of all the ballots cast in a given precinct. That is convenient because vote tabulation systems are designed to report subtotals for precincts. However, that means batches can contain 1,000 ballots or more. Audits could be more effective with far less hand counting if the ballots were divided into smaller batches. Ideally, each ballot is a batch of its own, as in the third approach above; this is called single-ballot auditing. Current vote tabulation systems cannot report their interpretation of individual ballots, which is prerequisite for single-ballot auditing. (However, see Stark, P.B., 2009. Efficient post-election audits of multiple contests: 2009 California tests. Refereed paper presented at the 2009 Conference on Empirical Legal Studies. http://ssrn.com/abstract=1443314.)
There are 100 4oz bags of various flavors of jelly beans—25lbs in all. Some bags have assorted flavors, some only a single flavor. All jelly beans weigh essentially the same. I love coconut jelly beans, and want to estimate how many there are among the 25lbs. Consider two approaches.
Both estimates are statistically unbiased, but the first has much lower variability. (Statistical bias is explained below; also see SticiGui.) Mixing disperses the coconut jelly beans pretty evenly throughout the pot. The sample is likely to contain coconut jelly beans in roughly the same proportion as the 100 bags do overall, so multiplying the number in the sample by 100 gives a reasonably reliable estimate of the total.
In contrast, a bag selected at random could easily contain only coconut jelly beans (if any of the bags has only coconut) or no coconut jelly beans (if any of the bags has none). Since the bags can have quite different fractions of coconut jelly beans, a 4oz bag selected the second way is quite likely to contain coconut jelly beans in a proportion quite different from the overall proportion of coconut jelly beans, so multiplying that number by 100 could easily be far from the total number of coconut jelly beans among the 100 bags.
Even though both procedures sample 4oz of jelly beans "at random," they do not give equally reliable estimates of the total number of coconut jelly beans among the 25lbs. For the first approach, we can get a reliable estimate of the number. But the second method is unreliable. To get a reliable estimate using the second approach—that is, counting the coconut jelly beans in randomly selected bags—we would need to look at quite a few bags (quite a few clusters, see below), not just one. It's more efficient to mix the beans before selecting 4oz. Then 4oz suffices to get a reasonably reliable estimate.
We have 100 12-ounce cans of stock, of a variety of brands, styles, and types: chicken, beef, vegetable, low salt, regular, etc. We want to know how much salt there is in all 1,200 ounces of stock as a whole. The assay to measure salt ruins the portion of the stock that is tested: The more you test, the less there is to eat. Consider two approaches:
Both estimates are statistically unbiased. However, the first estimate has much lower variability: That single tablespoon is extremely likely to have some stock from all 100 cans. The salt is likely to be spread out quite evenly through all the stock in the cauldron; it is very unlikely that the tablespoon will consist almost entirely of salt or almost entirely of water. Rather, the tablespoon is likely to contain salt in roughly the same concentration as the 100 cans do on the whole.
In contrast, a can selected the second way can be quite likely to contain salt in a concentration quite different from the 1,200 ounces of stock as a whole, unless all the cans have nearly identical concentrations of salt.
For the first approach, we can get a reliable estimate of the total salt from a single tablespoon (1/2 ounce) of stock. But for the second approach, even 12 ounces of stock is not enough to get a reliable estimate. (A tablespoon from the selected can would suffice to determine the salt in that can accurately, if the contents of the can have been stirred well. But even the exact amount of salt in the can does not give a reliable estimate of the salt in the 100 cans as a whole, because there is no mixing across cans.) The first approach gives a more reliable result at lower "cost": It spoils less stock.
To get a reliable estimate using the second approach—that is, without mixing the stock from different cans together—we would need to assay quite a few cans selected at random (quite a few clusters, see below). A single can is not enough, even though it contains 24 tablespoons of stock—far more than we need in the first approach. Sampling many randomly selected cans would amount to mixing the stock after the fact, but even then, it isn't mixed as well as it is in the first approach. It's more efficient and cheaper to mix the stock before selecting the sample. Then a single tablespoon suffices to get an accurate estimate.
A vote-tabulation error that overstated the apparent margin is like a coconut jelly bean or a fixed quantity of salt. A precinct or other audit batch is like a bag of jelly beans or a can of stock. Drawing the audit sample is like selecting a bag or a 4oz scoop of jelly beans or a tablespoon or can of stock. (An important difference between auditing and these food examples is that in elections there can be differences that decrease the apparent margin: What matters for verifying election outcomes is the net increase of the apparent margin over the margin that a full hand count would show, after differences that decreased the apparent margin are subtracted. Both differences that increase the apparent margin and differences that decrease the apparent margin matter for quality control and fraud detection.) Counting ballots by hand has a cost: The more you have to count, the greater the cost. Hence, you want to count as few ballots as possible as long as you can still determine whether the electoral outcome is correct. Similarly, counting jelly beans or assaying the salt in the soup has a cost.
In these two food examples, the first approach is like single-ballot or small-batch auditing. All the ballots are mixed together well, and we draw enough ballots to get a good idea of the net inflation of the margin on average across geography, voting methods, etc. Mixing the stock or the batter is like mixing the ballots in a huge vat, then reaching in and selecting some at random. The resulting sample of ballots is more likely to show about the same net rate of differences that increased the apparent margin as there are in the contest as a whole, compared to the net rate of differences in a sample that consists of the same number of ballots drawn as whole precincts.
In both examples, the second approach is like auditing using precincts or other large batches of ballots. There could be large concentrations of differences that increased the apparent margin in a small number of batches, because there is no "mixing" across batches. A single batch drawn using the second approach doesn't tell us much about the overall rate of margin-inflated differences in the contest, no matter how large the batch is (within reason). This approach is like rubber-banding big bundles of ballots together, mixing the collection of bundles, and selecting bundles at random, with no movement of ballots across bundles. To compensate for the lack of mixing across batches of ballots, we need to look at a lot of batches, just like we need to count the coconut jelly beans in many bags or assay many cans of soup if we don't mix their contents together across clusters before drawing the sample. The first method is much more efficient.
Suppose we have 50,000 ballots in all, 500 ballots cast in each of 100 precincts. Among them, 1,000 (i.e., 2%) have been interpreted incorrectly by the tabulator; the tabulator counted them for the apparent, unconfirmed winner but a manual count would show them to be for the apparent loser. The other 49,000 ballots were tallied correctly.
Suppose we will draw a random sample of 500 ballots to count by hand. We have the ability to check the machine subtotal for those 500 ballots against manual subtotal. If any of the misinterpreted ballots is among the 500, the machine and hand counts will not match, so we will know that at least one ballot has been misinterpreted. We will consider the three sampling schemes mentioned above, all of which select 500 ballots: drawing a precinct at random, drawing 10 batches of 50 ballots at random without replacement, and drawing 500 individual ballots at random without replacement (a simple random sample).
The next two subsections address two questions:
We will calculate the chance that the machine and hand counts do not match for three ways of drawing the sample: (i) drawing a single batch of 500 at random, that is, drawing a precinct's worth of ballots, (ii) dividing the precincts into 10 batches of 50 ballots each and drawing 10 batches at random without replacement from the resulting 1,000 batches, and (iii) drawing a simple random sample of 500 ballots at random.
For method (iii), the chance does not depend on how the misinterpreted ballots are spread across precincts: It is about 99.996%, no matter what. But for methods (i) and (ii), the chance does depend on how many incorrectly interpreted ballots there are in each batch. To illustrate that dependence, we will calculate the chance for several ways of spreading the misinterpreted ballots across batches. For simplicity, assume that when a precinct is divided into 10 batches, the number of misinterpreted ballots in each of those 10 batches is the same. For instance, if the precinct has 20 misinterpreted ballots, each of the 10 batches has 2 misinterpreted ballots.
precinct of 500
|10 randomly selected
batches of 50
|simple random sample
|10 in every precinct||100%||100%||99.996%|
|10 in 98 precincts, 20 in 1 precinct||99%||≈100%||99.996%|
|20 in 50 precincts||50%||99.9%||99.996%|
|250 in 4 precincts||4%||33.6%||99.996%|
|500 in 2 precincts||2%||18.4%||99.996%|
Phrased differently, the "confidence" (a slight abuse of the statistical term) that no more than 2% of the 50,000 ballots were misinterpreted if none of the ballots in the the sample were misinterpreted is as follows:
|sampling method||randomly selected
precinct of 500
|10 randomly selected
batches of 50
|simple random sample
Even though 500 randomly selected ballots are counted by hand in every case, the probability of finding a misinterpreted ballot varies enormously. In the case most favorable to precinct-based sampling, hand counting a single randomly selected precinct is guaranteed to find a misinterpreted ballot (10, in fact). But the chance falls quickly as the misinterpreted ballots are concentrated into fewer precincts. In the case least favorable to precinct-based sampling, the chance is only 2% for a randomly selected precinct and 18.4% for 10 randomly selected batches of 50—but remains 99.996% for simple random sampling. The smaller the batches, the greater the minimum chance the sample will show that at least one ballot was misinterpreted.
If the misinterpretations were caused by equipment failure in the precinct, that might be expected to concentrate errors in only a few precincts. If the misinterpretations occurred because pollworkers accidentally provided voters pens with the wrong color ink to mark the ballots, that might be expected to concentrate errors in only a few precincts. If a fraudster were trying to manipulate the outcome, he or she might target the ballots in only a few precincts, either to avoid detection or for logistical simplicity. In these three hypotheticals, if the sample is drawn by selecting an entire precinct it could easily be squeaky clean. But with the same counting effort, the chance of finding at least one error if the 500 ballots are drawn as a simple random sample remains extremely high, 99.996%, whether the misinterpreted ballots are concentrated in only a few precincts or spread throughout all 100.
While efficient risk-limiting auditing methods use the data in more complicated ways than simply asking "is there any error at all in the sample?," the amount of information the sample carries about the total number of errors depends strongly on how the sample is drawn. The smaller the clusters are, the harder it is to hide error—even though there are ways of scattering errors that make it easy for all three sampling methods to find at least one error.
The percentage of misinterpreted ballots in the sample percentage is not necessarily a reliable or accurate estimate of the percentage of misinterpreted ballots in the contest. The previous subsection shows that the chance of finding even a single misinterpreted ballot can be quite low when the percentage of misinterpreted ballots in the contest is 2%. When the sample doesn't find any error, the percentage of misinterpreted ballots in the sample is zero: The percentage in the sample underestimates the percentage in the contest by 100%.
But even when the sample does find some misinterpreted ballots, the percentage of such ballots in the sample can be much lower than the percentage in the contest as a whole. When that happens, we might conclude erroneously that the outcome of the contest is right when in fact it is wrong.
How likely is the percentage of misinterpreted ballots in the sample to be at least half the percentage of misinterpreted ballots in the contest as a whole? That is, what is the chance that the percentage of misinterpreted ballots in the sample is at least 1% when the percentage in the contest as a whole is 2%? The following table gives the answer for the same set of scenarios.
precinct of 500
|10 randomly selected
batches of 50
|simple random sample
|10 in every precinct||100%||100%||97.2%|
|10 in 98 precincts, 20 in 1 precinct||99%||≈100%||97.2%|
|20 in 50 precincts||50%||62.4%||97.2%|
|250 in 4 precincts||4%||5.7%||97.2%|
|500 in 2 precincts||2%||18.4%||97.2%|
Phrased differently, the "confidence" (a slight abuse of the statistical term) that no more than 2% of the 50,000 ballots were misinterpreted if 1% the ballots in the the sample were misinterpreted is as follows:
|sampling method||randomly selected
precinct of 500
|10 randomly selected
batches of 50
|simple random sample
As before, even though 500 randomly selected ballots are counted by hand in every case, the probabilities vary widely. In the case most favorable to precinct-based sampling, hand counting a single randomly selected precinct is guaranteed to reveal that at least 1% of the ballots were misinterpreted (in fact, it will show that 2% were). But the chance falls quickly as the misinterpreted ballots are concentrated into fewer precincts. In the case least favorable to precinct-based sampling, the chance is only 2% for a randomly selected precinct and 18.4% for 10 randomly selected batches of 50—but remains 97.2% for simple random sampling. Using smaller batches increases the chance that the percentage of misinterpreted ballots in the sample will be close to the percentage of misinterpreted ballots in the contest as a whole. Smaller batches yield more reliable estimates.
We start with some statistical terminology. There is a population we want to study. Population is a term of art. It need not consist of people. In election auditing, the population is the collection of ballots cast in the contest (or the auditable records corresponding to the ballots, such as VVPATs). In the food examples above, one population is 1,200 ounces of soup stock; the other is 25lbs of jelly beans.
For illustration, suppose that in the contest in question, 500 ballots were cast in each of 100 precincts, 50,000 ballots in all. The population will consist of these 50,000 ballots.
There is some property of the population we are interested in. That property is called a parameter. In the case of election auditing, the parameter is the net inflation of the apparent margin compared to the margin a full hand count would show. In the food examples, one parameter is the total number of coconut jelly beans; the other is the total amount of salt in the stock.
We want to learn about the parameter without examining every member of the population. Instead, we will look at a subset of the population, called a sample. Samples can be drawn in countless ways. We will consider two: simple random samples and random cluster samples.
A simple random sample is one in which every subset (of a predetermined size) of the population is equally likely to be drawn. Simple random samples are drawn "without replacement." For instance, a simple random sample of 4oz of jelly beans is one in which every 4oz is equally likely to be drawn. Such a sample can be drawn by mixing the jelly beans together really well, then reaching in without looking and scooping out 4oz. A simple random sample of one tablespoon (0.5 ounces) of soup stock can be drawn by putting all the stock in a big cauldron, stirring it well, then dipping in a tablespoon.
A simple random sample of 500 ballots can be drawn from a set of 50,000 ballots putting the ballots in a huge basket, stirring them really well, and drawing 500 without looking. (That turns out to be a terrible way to try to draw a simple random sample, because it's really hard to stir ballots. A better way is to put the ballots in some order, make a list of 50,000 random numbers, and take the sample to be the ballots corresponding to the 500 largest random numbers. For instance, if the 17th random number is the biggest, the 17th ballot would be in the sample.)
A cluster sample is one in which the population is partitioned into non-overlapping groups, called clusters; then a cluster (or a predetermined number of clusters) is drawn at random. A cluster sample of 4oz of jelly beans can be drawn by dividing the beans into 100 4oz bags, then picking one or more of those bags at random. A cluster sample of 12oz of soup stock can be drawn by dividing the soup into 100 12oz cans, then picking one or more cans at random. A cluster sample of 500 ballots could be drawn by picking one of the 100 precincts at random, and taking the sample to be the 500 ballots cast in that precinct. A simple random sample is the same as a cluster sample using clusters of size one.
If we estimate the total salt by stirring all the stock together, drawing a tablespoon at random, and multiplying the salt in the tablespoon by 2,400, that estimate is likely to be off by some amount. If we estimate the total salt in the stock by selecting a can at random and multiplying the amount of salt in the can by 100, the estimate is also likely to be off by some amount. The amount by which the estimate is off is called sampling error. The sampling error will tend to be much smaller in the first case, where the cans are mixed together before the sample is drawn, even though the sample is much smaller (there is 1/24 as much stock in a 0.5oz tablespoon than in a 12oz can).
Similarly, if we estimate the net inflation of the margin in the contest to be 100 times the net inflation of the margin in a sample of 500 ballots, that estimate is likely to be off by some amount, the sampling error, owing to the luck of the draw. The sampling error will tend to be smaller if the 500 ballots are a simple random sample than if they are a cluster sample consisting of all the ballots in a single precinct selected at random.
Sampling error tends to be smaller on average for simple random samples than for cluster samples. (There are exceptions, depending on how the clusters are formed. If the clusters are themselves random samples from the population and a single cluster is drawn, there's no difference between a cluster sample and a simple random sample. If the clusters are constructed so that they exactly match the population, the sampling error will be smaller for a cluster sample than for a simple random sample.) So, for instance, 100 times the net inflation of the margin in a simple random sample of 500 ballots typically tends to be closer to the total net inflation of the margin for the contest than 100 times the net inflation in a cluster sample of 500 ballots will tend to be.
In estimating parameters from samples, in addition to sampling error there is generally statistical bias, also called systematic error or non-sampling error. In the examples given here, the statistical bias is zero. That is because in these examples, every member of the population has the same chance of being selected for the sample: the expected value of the sample mean is then the population mean (see SticiGui for more explanation). If we repeated the procedure over and over, selecting one can or one tablespoon of stock at random, determining the amount of salt in that sample, and multiplying the result by 100 or 2,400, the average of those results would tend to get closer and closer to the total amount of salt in the 100 cans. However, the individual estimates based on the tablespoon drawn from well stirred stock would tend to be much closer to the truth than the individual estimates based on drawing a can at random would tend to be.
The same is true for election auditing: Drawing a simple random sample of 500 individual ballots and multiplying the net inflation of the margin in that sample by 100 will give a number that tends to be much closer to the net inflation of the margin in the whole contest than drawing a precinct of 500 ballots and multiplying the net inflation of the margin in that precinct by 100.
These examples contrast a cluster sample with a simple random sample, which is the extreme case of a cluster sample: clusters of size one. Intermediate cluster sizes give results with intermediate reliability. The inefficiency of cluster samples comes from the fact that the random sampling doesn't "mix" across the boundaries of clusters. The smaller the clusters are, the less that matters.
The smaller the clusters, the closer a cluster sample is to a simple random sample containing the same fraction of the population. For instance, imagine dividing the 50,000 ballots into 1,000 clusters of 50 ballots each instead of 100 clusters of 500 ballots each. Suppose 10 50-ballot clusters were drawn at random, and the net inflation of the margin for those 500 ballots were multiplied by 100. The result would be a more reliable estimate of the total net inflation of the margin in the contest than we would get from a single cluster of 500 ballots. But it still would be a less reliable estimate than we would get from a simple random sample of 500 ballots.
Reducing cluster size gives more information about the difference between the apparent margin and the margin a full hand count would show, for the same counting effort.
To my knowledge, there has been only one risk-limiting audit using clusters of size one, that is, a single-ballot risk-limiting audit. It was conducted in Yolo County, California, in November 2009. See Stark, P.B., 2009. Efficient post-election audits of multiple contests: 2009 California tests. Refereed paper presented at the 2009 Conference on Empirical Legal Studies. (preprint: http://ssrn.com/abstract=1443314) The biggest obstacle to conducting single-ballot audits or small-batch audits is the design of current commercial vote tabulation systems. They do not provide a record of the machine interpretation of individual ballots or small batches of ballots suitable for auditing. If vote tabulation systems were designed with single-ballot or small-batch auditing in mind, the cost savings to jurisdictions that wish to perform risk-limiting audits would be enormous.
With single-ballot audits, special precautions need to be taken to ensure voter privacy and discourage the buying and selling of votes. There are a many ways this can be accomplished. For instance, rather than publish the interpretation of individual ballots, the interpretation could be committed to by transmitting a digitally signed file to the Secretary of State; only precinct-level subtotals would be reported to the public. Alternatively, cryptographic commitments might be used. Or the reporting system could dissociate votes in different contests on the same ballot, giving each (contest, ballot) pair a randomly generated but unique identifier, so that the physical ballot can be retrieved and checked, but no one without access to the physical ballots would know the pattern of votes on an individual ballot.
Acknowledgments. I am grateful to Mark Lindeman, Joseph Lorenzo Hall, Mike Higgins, and John McCarthy for encouragement and helpful comments.
© 2010 P.B. Stark. Last modified 27 March 2010.