Philip B. Stark, 19 October 2009. (Last edited 20 January 2015.)
This document sketches some of my thoughts about election auditing. My technical papers on the topic are cited below.
Here, I address the prerequisites for audits, what good audit legislation might look like, logistical barriers to efficient audits, structural changes to election procedures and software required to support efficient audits, and next steps to make efficient risk-limiting audits widely practical. I conclude with a summary, references on risk-limiting audits, and references on bugs in Excel's statistical routines.
Post-election audits can serve a variety of roles, including process monitoring, quality improvement, fraud deterrence, and bolstering public confidence. I am most interested in using auditing to check whether election outcomes are correct—to limit the risk that an incorrect preliminary outcome is certified.
Financial auditors distinguish between compliance audits and materiality audits. Compliance audits determine how well policies and procedures have been followed. Materiality audits determine whether errors amount to much money. In election auditing, compliance audits are important for ensuring the accuracy of elections … but I know of no jurisdiction that routinely performs compliance audits. The appropriate notion of materiality for election audits is whether errors—whatever their source— caused the wrong candidate to appear to win. There is no jurisdiction that performs routine materiality audits, but election integrity advocates, legislators, elections officials, and scientists are now expressing interest, and there have been six tests of materiality audits in California. In a materiality audit, it does not matter what caused the errors nor whether the errors can be "explained." What matters is whether they changed the electoral outcome.
Some audits, such as California's "1%" audit, don't quite fit in either category. California's 1% audit is closer to manufacturing process monitoring than to a financial audit. It is not a compliance audit because it does not determine whether correct procedures were followed. Compliance audits should answer questions such as: Was chain of custody maintained and documented properly? Were voted ballots or election equipment ever taken home by a pollworker? Were all ballots accounted for? Were all security seals intact? Was the required minimum number of people present whenever voted ballots were handled?
While California's 1% audit does measure the accuracy of the counts, the precision of the accuracy measurement varies by contest, and generally is too low to determine whether the reported outcome of a contest is in doubt—especially if the contest is small or has a small margin. Hence, the California 1% not quite a materiality audit. The rest of this document is about using audits to check electoral outcomes: materiality audits.
The notion of a batch of ballots is crucial for post-election auditing. In this document, a batch of ballots is a group of ballots or voter-verified paper recored (i) for which subtotal of the votes are reported (or committed to) by the Vote Tabulation System, and (ii) that can be retrieved and counted manually. For instance, a batch might consist of the ballots cast in person in a precinct, or a subset of the ballots cast in person in a precinct, if the vote subtotals for the subset are reported and if there is a way to separate those ballots from the rest for hand counting. Alternatively, a batch might consist of a "deck" of vote-by-mail (VBM) ballots containing ballots from several precincts run through a scanner as a group. In some auditing schemes, a single ballot is a batch; such single ballot auditing methods require special equipment such as printer-scanners that can print a unique identifier on a ballot while scanning the ballot. This document focuses on auditing methods that can be "bolted on" to any voting system that produces an audit trail.
Until 2007, most technical work on post-election auditing focused on the following question: Suppose we are drawing a random sample of batches of ballots to audit and that the outcome of the contest is wrong. How many batches do we need to examine to ensure that there is a high chance that the sample will reveal one or more counting errors?
Audits of voter-marked paper ballots almost always find errors at the rate of a modest fraction of a percent, owing largely to the fact that a human tallier can interpret voter intent better than an optical scanner can—especially because voters do not always follow the instructions for marking ballots. (If the margin is small, this background rate of "benign" error could have changed the apparent outcome.) Because it is easy to find errors, the auditing problem is not how big a sample to take to find error. The problem is whether to confirm the outcome in light of the error the audit finds.
Hence, a more important question for election auditing is this: Given the way the sample was drawn and the errors found in the sample, how strong is the evidence that the outcome is correct? If the evidence is weak, counting should continue, either until the evidence is strong or until all the ballots have been counted by hand. That approach makes it possible to limit the risk of certifying an election outcome that disagrees with the outcome a full hand count would show. The outcome that a full hand count would show is generally the legal touchstone: By definition, it is the "correct" outcome.
Post-election audits require a number of things:
The next sections discuss each of these requirements.
Election management systems and vote tabulation systems, which I shall refer to collectively as EMSs, currently do not support reporting batch-level results in a machine-readable, structured format useful for auditing. Generally, they do store that information internally, and can print reports containing the requisite information. But editing or transcribing the batch-level results to design and conduct audits is time-consuming and error prone. Indeed, it was the largest burden in some of the risk-limiting audits we conducted in California in 2008.
Small batch size is crucial to the efficiency of audits. EMSs should be able to store and report vote subtotals for arbitrarily small batches. Moreover, state laws should avoid requiring large audit batches. For instance, California's 1% audit requires ballots cast in a precinct and vote-by-mail (VBM) ballots for that precinct to be audited together. If the law permitted them to be audited separately, it would cut batch size in half, which would increase audit efficiency markedly. This issue is discussed further below.
There is a tradeoff between batch size and voter privacy, however. If a group of voters can be matched to a group of ballots that record similar voting patterns, one can determine confidently how those voters voted. That does not depend only on the size of the reporting batches: For instance, if batch of 1,000 ballots records 100% for one candidate, we know exactly how all 1,000 voters in that batch voted. Indeed, if a contest has a unanimous winner, we know how everyone voted. Conversely, if a batch of two ballots records 50% for one candidate and 50% for another, we do not know how either of the two voters in that batch voted. What matters for privacy is not just the batch size, it is also the variability of the votes within the batch.
I am not aware of any quantitative rule that governs the size of reporting batches to maintain voter privacy. A principled rule might set a threshold on the extra amount of information the batches give about how individuals voted. For instance, batches might be kept large enough that the error in predicting an individual's vote from the batch subtotals is at least 50% of the error in predicting an individual's vote from the contest totals alone. That amounts to aggregating batches so that the margin in a combined batch is not too different from the overall margin. A simpler rule that requires aggregating batches so that reported subtotals include at least 25 voters is more practical, although it will occasionally compromise voter privacy.
Another "solution" to the privacy problem is to separate the issues of public reporting and auditing. Audit batches can be smaller than reporting batches, provided there is a mechanism to ensure that the auditable subtotals are committed to indelibly before the audit begins. This might involve putting the subtotals in escrow, sending them to the Secretary of State, or signing them digitally and publishing them in encrypted form.
An audit can be no better than the audit trail it examines. At the moment, that means voting systems need to have a voter-verified paper trail. There is wide agreement that voter-marked paper ballots are the most secure, reliable, transparent, and auditable technology currently available—and I concur. As the California TTBR found, a DRE VVPAT is volatile—the process by which it is printed does not produce as durable a record as ink on paper. (Entire spools can be spoiled.) And VVPAT printers tend to jam. Moreover, it is much harder to determine whether VVPAT spools have been truncated or gone missing, because the amount of paper used by a voter varies with the number of times the voter changes his or her mind.
For any voting system, checks that the number of votes reported for a precinct is not larger than the number of registered voters should be conducted for every contest in every election. Indeed, given poor voter turnout, precincts in which turnout substantially exceeds historical values should get special scrutiny.
Physical security of the audit trail is obviously crucial. The audit trail needs to be stored in a secure location. Chain of custody needs to be tracked meticulously. Seals and signatures should be checked and breaches must be reported to the SoS and the public. Several people should be present whenever a seal is removed or replaced. And so on.
With voter-marked paper ballots, there are simple measures that can help ensure that the audit trail is complete and accurate. For instance, there can be "ballot accounting" or "ballot reconciliation" to check that the number of ballots sent to a precinct equals the number returned from the precinct voted, spoiled, and unvoted.
Hand counting is subject to error. Generally, if a hand count of a batch disagrees with the preliminary results for that batch, the hand count is repeated until it is clear that the discrepancy is real and not the result of error in the hand count. I have never seen written instructions that specified how many times to count by hand, when to conclude that the discrepancy is real, and so on. I'm sure some jurisdictions have such instructions, but I would guess that these vary considerably from county to county and state to state.
Hand counts are generally conducted by teams of two, three, or four people. In four-person teams, the usual approach is for one person to read the votes aloud, a second person to look over the reader's shoulder to make sure he or she read correctly, and two talliers to record the vote separately, so their tallies can be checked against each other.
Some jurisdictions sort the ballots by how the votes were cast, then count stacks of ballots (sort-and-stack). Some cut VVPAT rolls apart before counting; some keep the rolls intact. Some count one contest at a time on each ballot; some count all contests simultaneously. I am not aware of any scientific studies of the efficiency and accuracy of competing methods for hand counting. Anecdotal evidence suggests that sort-and-stack is less accurate than simply calling out the votes sequentially.
Selecting batches at random is essential to efficient auditing. As described below, methods for selecting random batches should involve a mechanical source of randomness, such as rolling dice. Many sampling schemes have been proposed, including simple random sampling, stratified simple random sampling, NEGEXP sampling, sampling with probability proportional to an error bound (PPEB), and combinations of random sampling and "targeted" sampling.
For all those approaches, it is possible to assess the evidence that the outcome is correct in light of the errors the audit finds. (See, for instance, this paper.) If the goal is to ensure that whenever the outcome is wrong, there is a large chance of a full hand count to set the record straight, it does not matter how large the initial sample is: A sensible procedure will examine the sample, assess the evidence that the outcome is correct, and require more counting if that evidence is not sufficiently strong. The evidence could be weak because the sample is small or because the margin is small or because the audit found many errors that favored the apparent winner. What is crucial to controlling the risk of certifying a wrong outcome is when to stop counting. The rule must ensure that the chance of stopping before a full hand count is small if the apparent outcome is wrong. See the citations to my work below.
Currently, legislation and discussions of auditing are bogged down in arguments about how large a sample to draw initially. I think this is quite counterproductive. It makes more sense to focus attention on rules for stopping the count.
Sound statistical rules for stopping short of a full hand count depend on the reported votes in each auditable batch, on how the sample is drawn, on the sample size, on the observed discrepancies in the sample, and on technical details including the choice of statistical tests. Some sampling schemes, such as PPEB, make it easier to devise tests that take into account the distribution of error in the sample and allow errors that hurt the apparent winner to strengthen the evidence that the outcome is correct. Other sampling schemes make it difficult to use anything but the largest (weighted) error observed in the sample to assess the evidence. This is an area of rapid progress. The most efficient method so far is based on PPEB and the Kaplan-Markov P-value. See the references below.
I think it is a bad idea to legislate particular audit methods, because that can lock in a method that does not work as claimed, or that is superseded by a more efficient method. Instead, I think audit laws should be as simple as possible and enunciate principles, rather than procedures. For instance, something like the following seems reasonable:
(i) Every statewide contest shall be audited to ensure that, if the preliminary outcome of the contest is incorrect, there is at least a C% chance that the audit will correct the outcome.
(ii) Contests that are not statewide but for which more than X registered voters are eligible to vote shall be audited to ensure that, if the preliminary outcome of the contest is incorrect, there is at least a D% chance that the audit will correct the outcome.
(iii) A random sample of Y% of the contests not audited under (i) or (ii) shall be audited to ensure that, if the preliminary outcome of the contest is incorrect, there is at least an E% chance that the audit will correct the outcome. The contests to be audited under this provision shall not be selected until after the preliminary results for all contests have been published.
(iv) At least Z% of the ballots cast in contests not audited under (i)–(iii) shall be selected at random and counted by hand. Elections officials shall report the strength of the evidence that the outcomes of those contests are correct. Strength of evidence shall be defined to be the smallest chance that the ballots would contain as little error as they were found to contain, if the outcome were in fact incorrect. The definition of "little" shall be defined in regulation.
It would be up to lawmakers to choose C, D, E, X, Y, and Z. It would be a matter of regulation to specify how to achieve the limits on risk in (i)–(iii) and the measure of "little" and the calculation of the probability in (iv). Regulation could list one or more methods that are deemed by the Secretary of State to be acceptable, and principles and procedures for evaluating and gaining approval for other methods.
California AB 2023 (Saldana) calls for a pilot of risk-limiting audits in 2011. It is the first bill to get risk-limiting language right: It requires the audit to have a large chance of resulting in a full hand count if the electoral outcome is wrong, in which case, the hand count determines the official outcome.
As of this writing, there are methods to limit risk for every sampling scheme that I have seen proposed for post-election audits: simple random samples, stratified simple random samples, sampling with probability proportional to a bound on the error in each batch (PPEB), and sampling with probability related to the negative exponential of a bound on the error in each batch (NEGEXP). Moreover, there are techniques to incorporate "targeted," deterministic sampling into the risk calculations. See the citations below.
It should be noted that ensuring a high probability of correcting a wrong outcome is not the only sensible thing to do. For instance, ensuring that at most S% of certified outcomes are incorrect seems quite reasonable, and could well require less counting. However, there is as yet no method to accomplish the latter, which is tied to the idea of the False Discovery Rate.
Legislation should require disclosure of all algorithms and the source code for all software used in an audit—even commercial, off-the-shelf (COTS) software—so that the public can verify that the calculations were correct. The fact that software is commercial is no guarantee that it does what it's supposed to do: See the references below on known bugs in Excel.
Legislation should also require prompt publication of batch-level results, prior to the selection of audit samples. If stratified sampling is permitted, results should be published for all batches in a stratum before the sample is selected from that stratum. Batch-level results should be available in one easy-to-find location, for instance, the Secretary of State's website.
If batches are extremely small—for instance, single ballots—it might make more sense to report subtotals for somewhat larger groups of ballots, and have a mechanism that allows elections officials to commit indelibly to the subtotals for the small batches. For instance, the jurisdiction could put a digitally signed file in escrow with the Secretary of State; the auditors would compare hand counts to that reference copy to determine whether the original count has errors.
And legislation or regulation should require "sanity checks" such as (i) verifying that the reported vote total for a precinct does not exceed the number of registered voters in that precinct, (ii) (for paper ballots) verifying that the number of ballots returned voted, spoiled, and unvoted equals the number sent to the precinct, and (iii) verifying that the vote subtotals reported for the batches sum to the totals for each contest. Discrepancies should be posted in a central location, such as the Secretary of State's website.
There are very real limits to the amount and nature of work that elections officials can add to the burden of a canvass. Expecting elections officials to perform statistical computations is perhaps unreasonable, but expecting officials to report election results and audit results at the batch level would be reasonable if EMSs supported structured data output. EMSs currently are not designed to generate the reports needed to design and perform audits. As discussed below, trivial changes to EMSs could make auditing substantially easier and more efficient by exporting structured, machine-readable data at the level of arbitrarily small batches.
Questions that need to be considered include:
One political barrier that continues to surprise me is the insistence of some election integrity advocates and politicians on putting detailed methods into law, for instance, by mandating sample sizes in complicated tables and (mis)using statistical jargon. I think that makes bad law and bad audits.
Election management needs to change in a couple of simple ways for routine audits to be performed efficiently and economically. If these changes are brought about—through legislation, regulation, or market pressure—the rest of the auditing problem could be solved by "bolt-on" audit procedures. The latest and best audit procedures could be used as soon as the changes happen. Hence, I believe that the following structural changes should start now. The changes regard data export from election management systems, batch sizes, and data reporting.
Any audit requires batch-level data on the number of votes cast and how they were cast. As mentioned, EMSs generally do not support exporting batch-level data in a structured, machine-readable format. That needs to be addressed.
EMSs are generally built on databases that can be queried using SQL. It would be next to trivial to write SQL queries to export the data needed for audits—but it requires vendor cooperation. If states would demand that vendors provide this functionality, routine audits would be much easier and more efficient. However, such changes might require recertification of the EMS. On the other hand, query tools that could be used with a replicated database would suffice; whether that would be acceptable requires a legal determination of whether the audit is part of the canvass.
The plumbing should be changed to make it easy to collect batch-level election data and batch-level audit data in a central location, such as the Secretary of State's website. This could be facilitated by adopting a standard structured data format for election data, such as OASIS/EML.
Generally, the smaller the auditable batches, the less counting is needed to confirm the outcome (at a given risk limit) if the outcome is correct. See Stark, P.B., 2010. Why small audit batches are more efficient: two heuristic explanations (http://statistics.berkeley.edu/~stark/Preprints/smallBatchHeuristics10.htm) for heuristic explanations of the effect of reducing batch size. Modest reductions in batch size can cut auditing burden enormously. For instance, simulations using data from Marin County, CA, show that reducing batch size from precincts (with VBM and in-person votes combined, as currently required by California law) to a maximum of 100 ballots could reduce risk by a factor of 10, counting the same number of ballots in all. California's 1% audit could actually confirm many outcomes if the batch size were reduced from entire precincts of combined in-person and VBM ballots to smaller batches, such as 25–100 ballots.
How can we reduce batch sizes? For DREs with VVPATS, we could audit machine results instead of precinct results; EMSs generally can track results separately by machine. If the VVPAT rolls corresponding to a given machine can be identified, those subtotals can be audited. Regardless of the voting technology, we can keep votes cast in person separate from votes cast by mail to create more, smaller batches.
For jurisdictions that use central-count optical scan systems (CCOS), ballots can be divided into small "decks" of no more than 100 ballots before counting. The jurisdiction would need to keep decks separate so that they can be counted by hand if the deck is selected for audit, and the EMSs would need to be able to report subtotals by deck. This approach will be tested in Marin County, CA, in November 2009.
For jurisdictions that use precinct-count optical scan systems (PCOS) here is an idea for reducing batch sizes without increasing the burden on pollworkers. EMSs can track votes by ballot style. Artificially increasing the number of ballot styles by marking groups of no more than 100 ballots with a barcode to identify them as a batch would allow existing software to track ballots in smaller batches (and potentially to report subtotals for those batches). It would not be necessary for jurisdictions to account for each ballot pseudo-style sent to a precinct separately; the difference between the styles is solely so that the EMS can tally subtotals for each batch and so that—if the batch is selected for audit"the ballots that comprise the batch can be identified.
Using pseudo-ballot-styles would increase printing costs (modestly, I think—I hope to assess this quantitatively), but would greatly reduce audit costs. On the other hand, it could greatly increase the costs of logic and accuracy testing, if every ballot pseudo-style needs to be tested individually. But perhaps logic and accuracy testing could be done with a mix of ballot pseudo-styles for each precinct (say, half a dozen of each pseudo-style, mixed together), instead of using separate testing for each ballot pseudo-style.
Here is a sketch of how the idea would work. Consider a precinct with 800 registered voters, of whom 300 request absentee ballots. The jurisdiction would print three batches of 100 VBM ballots that are identical except for a barcode and a letter (A–C). The jurisdiction would print (up to) five batches of ballots to be used in the precinct, identical except for a barcode and a letter (D–H). When the ballots are scanned—whether using CCOS or PCOS—the barcode would make it possible for the EMS to subtotal batches of no more than 100 ballots. The data plumbing changes proposed elsewhere in this document would make it possible for the EMS to export those subtotals in a useful format.
If the audit selects one of the batches from a precinct to count by hand, the ballots in that precinct are then sorted manually (using the letter code) or with an automated sorter (using the barcode) to separate the batch that is to be counted by hand. There would be no need to sort and separate batches in precincts in which no batches are to be audited.
Using small batches raises privacy concerns. Steps should be taken to ensure that pollworkers do not use the proliferation of ballot pseudo-styles to determine how particular subgroups of voters voted, for instance, by giving all senior citizens ballots of one style. Omitting the human-readable letter would improve voter privacy, since—on the assumption that people do not routinely read barcode—neither the voter, pollworkers, nor elections officials would know which pseudo-ballot-style a given voter received. To increase voter privacy, the ballot pseudo-styles could be shuffled before handing ballots to voters, so that there is no way to know which batch contains a given voter's ballot.
Reducing batch sizes can reduce the burden of hand counting by a factor of hundreds. It could increase costs in other ways, though. For instance, the physical batches of ballots need to be retrievable, which entails some organizational costs. Using ballot pseudo-styles would increase printing costs, modestly, I think—but I hope to assess this quantitatively. Reducing the size of the "remainders" would reduce some waste: For instance, if a precinct has 515 registered voters, one might print 75-ballot batches, so that the remainder is 515−450 = 65, rather than 515−500 = 15. Proliferating ballot pseudo-styles could greatly increase the cost of logic and accuracy testing if every ballot pseudo-style needs to be tested individually. But perhaps logic and accuracy testing could be done with a mix of ballot pseudo-styles for each precinct (say, half a dozen of each pseudo-style, shuffled together), instead of using separate testing for each ballot pseudo-style. There is clearly room for clever solutions. Longer term, voting systems should be designed to facilitate creating and reporting small batches.
Jurisdictions should be required to publish batch-level subtotals for all contests promptly and before auditing begins—preferably at a central location for the entire state, such as the Secretary of State's website. (Alternatively, the batch-level subtotals need to be committed to indelibly before the audit starts, as discussed above.) The report should include the number of registered voters in the batch, the number of ballots cast in the batch, and the number of votes cast for each candidate in each contest. Jurisdictions should also be required to publish audit procedures and audit results: the number of votes found for each candidate in each audited batch in each audited contest.
As noted above, jurisdictions should also be required to publish algorithms and source code for any software used in an audit.
It is crucial that the batches be selected at random. It is crucial that the random selection have a real, observable random input, such as rolls of 10-sided dice. However, it is impractical to roll dice for each batch to be selected. Instead, it makes sense to roll 10-sided dice a moderate number of times (e.g., 10–15) and use the result as a seed in a high-quality, open-source pseudo-random number generator (PRNG). Using an open-source PRNG enables the public to input the seed that was generated by dice rolls and confirm that the correct precincts were audited. Using a PRNG also makes it possible to have one public "drawing" from which any size sample can be generated by continuing the sequence that the PRNG produces.
Using Excel to select random batches should be prohibited. The PRNG in Excel known to be faulty, and it does not permit the user to specify a seed.
I expect that we will have very efficient audit methods applicable to a wide range of practices within about two years—probably before the data plumbing and batch-size issues can be worked out. The theoretical bottlenecks right now are in applying the most efficient sampling scheme (PPEB) to situations where logistical concerns mandate stratification (contests that cross jurisdictional boundaries). Drawing a stratified PPEB sample is not a problem, but analyzing the data from such a sample to control risk is an unsolved problem.
Experiments to determine the rate of manual counting errors using different counting strategies could be very helpful. Are 4-person teams more accurate than 3-person teams? Is sorting and stacking more accurate and faster than calling out each ballot? Is it more accurate to count one contest at a time, or to count every contest on a ballot at once? Which is faster? How do these findings depend on ballot design?
It would be helpful to have systematic data collection on discrepancies found by audits, the apparent causes of those discrepancies, and other variables such as the technology used to count votes, the ballot design, and voter instructions. Those data should be reported nationally in a central repository, perhaps on the US EAC's website. Such data would be useful not only for improving ballot design, voter instructions, and voting technology, but also for making informed decisions about when to require automatic recounts. For instance, if it were determined that CCOS misinterprets voter intent about 0.1% of the time for voter marked ballots that ask the voter to connect the candidate to the office by drawing a line, then if the reported margin in a contest using that voting system were 0.1% or below, it might make sense to skip the audit and simply count the entire contest by hand.
Post-election auditing has made enormous strides in the last few years. The paradigm has shifted from detecting error to confirming outcomes. It is now possible to determine the strength of the evidence that an outcome is correct using a wide variety of sampling schemes. Legislation and regulation are changing, but so far the only real success (in my opinion) is California AB 2023, which is still under consideration. I would consider the new laws that have passed to be failures because they actually preclude doing a good job. Legislation that mandates particular sampling schemes, sample sizes, etc., is counterproductive. I think the place to focus immediate attention is on legislation and regulation to get the plumbing in place that will enable effective, efficient audits. The methods to use the data will be available by the time the data are there to be used.
Ash, A., S. Pierson, and P.B. Stark, 2009. Thinking outside the urn: Statisticians make their marks on U.S. Ballots. Amstat News, 384. 37–40. (reprint: http://www.amstat.org/outreach/pdfs/SP_ANJun09.pdf)
Hall, J.L., L.W. Miratrix, P.B. Stark, M. Briones, E. Ginnold, F. Oakley, M. Peaden, G. Pellerin, T. Stanionis and T. Webber, 2009. Implementing Risk-Limiting Audits in California, USENIX EVT/WOTE. (preprint: http://arxiv.org/abs/0905.4691)
Miratrix, L.W., and P.B. Stark, 2009. Election Audits using a Trinomial Bound. IEEE Transactions on Information Forensics and Security. Accepted. http://statistics.berkeley.edu/~stark/Preprints/trinomial09.pdf
Stark, P.B., 2009. Efficient post-election audits of multiple contests: 2009 California tests. Refereed paper presented at the 2009 Conference on Empirical Legal Studies. (preprint: http://ssrn.com/abstract=1443314)
Stark, P.B., 2009. Risk-limiting post-election audits: P-values from common probability inequalities. IEEE Transactions on Information Forensics and Security. Accepted. http://statistics.berkeley.edu/~stark/Preprints/pvalues09.pdf
Stark, P.B., 2009. CAST: Canvass Audits by Sampling and Testing. IEEE Transactions on Information Forensics and Security: Special Issue on Electronic Voting. Accepted. http://statistics.berkeley.edu/~stark/Preprints/cast09.pdf
Stark, P.B., 2008. A Sharper Discrepancy Measure for Post-Election Audits, The Annals of Applied Statistics, 2, 982–985. http://arxiv.org/abs/0811.1697
For issues and horror stories on spreadsheets more generally (not bugs), see The European Spreadsheet Risk Interest Group. The user interface (UI) of spreadsheets invites errors, then makes the errors hard to find, in part because the UI conflates input, code, output, and presentation. Performing unit testing on spreadsheets is difficult.
This bibliography singles out Excel, but Excel is not the only commercial, off-the-shelf computational software with bugs. I have run into bugs in both MATLAB and SAS that produced seriously erroneous numerical results. But Excel is very widely used, and this literature is at hand.
McCullough, B.D., Heiser, David A., 2008. On the accuracy of statistical procedures in Microsoft Excel 2007. Computational Statistics and Data Analysis 52(10), 4570–4578.
Excerpt: Excel 2007, like its predecessors, fails a standard set of intermediate-level accuracy tests in three areas: statistical distributions, random number generation, and estimation. Additional errors in specific Excel procedures are discussed. Microsoft's continuing inability to correctly fix errors is discussed. No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel's statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.
If users could set the seeds, it would be an easy matter to compute successive values of the WH RNG and thus ascertain whether Excel is correctly generating WH RNGs. We pointedly note that Microsoft programmers obviously have the ability to set the seeds and to verify the output from the RNG; for some reason they did not do so. Given Microsoft's previous failure to implement correctly the WH RNG, that the Microsoft programmers did not take this easy and obvious opportunity to check their code for the patch is absolutely astounding.
McCullough, B.D., 2008. Microsoft's 'Not the Wichmann-Hill' random number generator. Computational Statistics and Data Analysis 52 (10), 4587–4593.
McCullough, B.D., Wilson, Berry, 2005. On the accuracy of statistical procedures in Microsoft Excel 2003. Computational Statistics and Data Analysis 49 (4), 1244–1252.
Knüsel, L., 2005. On the accuracy of statistical distributions in Microsoft Excel 2003. Computational Statistics and Data Analysis, 48(3), 445–449.
I thank Mark Lindeman for encouraging me to write this and for helpful comments. I thank Jennie Bretschneider, Elaine Ginnold, Aviva Shimelman, and Mitch Trachtenberg for helpful comments. Opinions and errors are, of course, my own.
P.B. Stark, 3 April 2010