The problem of determining whether a treatment has an effect is ubiquitous in science, engineering, social science, economics, business, and many other fields. Treatment is meant generically: It could be a magnetic field, a metallic coating, welfare, decreasing the marginal income tax rate, a drug, a fertilizer, or an advertising campaign. Effect could be to bend light, to increase the durability of a part, to decrease the crime rate, to increase savings, to relieve headaches, to increase crop yields, or to increase sales. To evaluate whether a treatment has an effect, it is crucial to compare the outcome when treatment is applied (the outcome for the treatment group) with the outcome when treatment is withheld (the outcome for the control group), in situations that are as alike as possible but for the treatment. This is called the method of comparison. How individuals come to be in the treatment group is important, too: This distinguishes experiments from observational studies.

The most reliable way to determine whether a treatment has an effect is to compare the outcome for the treatment group with the outcome for a control group, using a random mechanism to allocate individuals between the treatment group and control group. This is called a controlled randomized experiment. If the individuals are people, taking precautions to ensure that they do not know whether they are in the treatment group or the control group can reduce confounding—this is called blinding. If evaluating the outcome involves subjective judgment, it is better if the evaluators do not know which individuals are in the treatment group and which are in the control group. When combined with blinding, this is called double-blinding.

The Method of Comparison

The method of comparison is arguably the most important idea in science: To determine whether a treatment has an effect, compare what happens with and without the treatment.

For example, this class is taught using Internet-based materials. How might we assess whether this way of teaching is effective? Suppose we simply look at the final exams of students who take the class. They will answer some questions correctly. Does that show that the materials are effective? Obviously not: The students might have known the answers to those questions before taking the course; they might have learned the answers from some source other than the online material; and they might have gotten the answers right by lucky guessing.

To determine whether students learn anything from the class (the treatment), we need to compare the state of their knowledge after taking the class with something else. The "something else" could be students who have not taken the class, but it would be even better to compare the students with themselves before they took the class: We can control for what the students know when they enter the class by administering a pretest at the beginning of the term, so that we can compare the final exam scores with the pretest scores to measure (in some way that we shall leave vague) the increase in their knowledge of Statistics. This is an example of the method of comparison: Compare student performance on an exam before and after the treatment, which in this case is taking the Web-based class. In the method of comparison, one compares the outcome with treatment to the outcome without treatment.

The method of comparison is one of the most important and fundamental empirical techniques, because it can (in principle) isolate the effect of one factor—the treatment. In this example, using the method of comparison helps us separate the effect of the instructional materials from what people happened to know before coming into the class.

More interesting and perhaps more important than whether students learn anything at all from an online class is how online instruction succeeds compared with traditional instruction. To isolate the difference attributable to the method of instruction, we need to compare how much students learn in a traditional statistics class with how much students learn in a Web-based class. This is an example of the method of comparison applied to two treatments: One group receives traditional lecture instruction, and the other group receives online instruction.

The method of comparison is a general and flexible tool for evaluating whether a treatment is effective, be it a drug, fertilizer, teaching technique, surgical procedure, quality control measure, car wax, or advertising campaign. The basic idea is to compare what happens with and without the treatment, to isolate the effect of the treatment. If only some of the individuals are treated, and the outcome for them is compared with the outcome for individuals who are not treated. The group that receives treatment is called the treatment group, and the group that does not is called the control group. Sometimes, as is the case in pretest/final assessment, a single group's responses with and without treatment are compared (the individuals are their own controls). Sometimes, as is the case in comparing traditional instruction with online instruction, there are two or more treatment groups, and the responses of different groups are compared. The group that takes the traditional course probably would be considered the control group in that example, because that is the default method of teaching. These are variations of the method of comparison. Without controls, we have no idea whether the treatment is the cause of any observed response. With controls, it can be possible to determine whether treatment has an effect.

Confounding

Even with the method of comparison, differences between the control group and the treatment group other than the treatment can be responsible for observed differences in outcome between the two groups. This is called confounding. Confounding can hide a real effect, or can produce the spurious appearance of a treatment effect when the real cause is a difference between the treatment and control groups other than the treatment. Individuals' responses to treatment differ, as do individuals' responses in the absence of treatment. Some causes of those differences might be known, but many are not. If the treatment group predominantly contains individuals who would do well (or who would do poorly) whether or not they received treatment, we cannot separate the effect of the assignment from the effect of the treatment: Confounding still could be responsible for an apparent treatment effect, or could obscure a real treatment effect.

Example (from Wang, 1993). The Trenton Times, quoting the New York Times, reported on February 10, 1988, that the college entry test "coaching industry is playing on parental anxiety," arguing that coaching does not improve test scores. The data on which they base this conclusion are gleaned from questionnaires sent to 1409 Harvard University freshmen in the fall of 1987. Of those surveyed, 69% said they had received no coaching, and 14% said they had received coaching (what happened to the other 17%?). The verbal and mathematical SAT scores of the students are listed in

Because the scores of the students who were coached were, on average, lower than those of the students who were not coached, the author argued that coaching does not help.

What else might be going on?

If the students who sought coaching were weaker on the whole than those who did not (after all, why did they seek help?), we would expect the students who sought coaching to do worse than those who did not, unless coaching were so effective that it more than wiped out the natural difference. The propensity to seek coaching is confounded with the effect of coaching, if any.

Is coaching helpful? One cannot say from these data.

Historical Controls

Sometimes a treatment group is compared to individuals from some other epoch who did not receive the treatment. For example, one might compare the clinical outcomes of patients who undergo a new surgical procedure to treat a diagnosed condition with the clinical outcomes of patients diagnosed with that condition before the surgical procedure was available. However, with historical controls, the control and treatment groups tend to differ in more ways than just the treatment, and confounding tends to be a problem.

Longitudinal and Cross-Sectional Comparisons

Time is often considered to be a treatment, for example, in studying the effect of aging. There are two common strategies to study the effect of time: compare individuals of different ages at a single moment in time, and follow individuals over time as they age. The first is called a cross-sectional comparison or a cross-sectional study; the second is called a longitudinal comparison. Cross-sectional comparisons are more prone to suffer from confounding. Some examples follow.

Freedman et al. (1997) present data from the Health and Nutrition Examination Survey of 1976–1980 (HANES). The Public Health Service examined a cross-section of Americans whose ages ranged from 1–74. HANES data on average height versus average age in groups of about 10 years of age show that heights decrease consistently from about age 20, when the average height of men is about 70 inches, to age 70, when the average height of men is about 68 inches. Similarly, men’s weights seem to be lowest for those in their twenties, peak around age 40–50, then decrease. Does this mean that as men age they get shorter and fatter?

There is a similar example in Huff (1993): The angle between women's feet is larger for older women (at least, at the time the book was written) than for younger women. Does this mean that as women age, their feet turn out?

Both the HANES study described by Freedman et al. and the example described by Huff are cross-sectional comparisons: The people compared are a cross section of the population at a particular time. They differ from each other in many more ways than just their ages. They grew up in different times, when eating and exercise habits were different, when different amounts of hormones and antibiotics were fed to animals destined for human consumption, etc. In fact, what the data reflect is a secular trend: The HANES data show that at maturity, people are taller and heavier than they used to be. In Huff's example, the cause of the apparent age effect is that women used to be encouraged to walk with their feet turned out: It was considered more elegant. Those women grew up to become a significant fraction of the sample of older women. Their feet didn't turn out more as they aged, and the feet of the younger women in the sample probably won't either. Both of these examples illustrate a secular trend confounding with age.

One cannot draw reliable conclusions about the effect of age using cross-sectional comparisons, because of confounding. Longitudinal comparisons—where the investigators follow subjects over time, comparing each subject to himself or herself at different ages—provide more persuasive evidence for the effects of age than do cross-sectional studies. In longitudinal comparisons many possible confounding factors cancel in comparing each individual with himself. However, longitudinal comparisons are more difficult, more expensive, and more time consuming than cross-sectional comparisons: The investigator must keep track of the subjects over time, keep in touch with them, maintain records, and wait for years or decades to pass to collect the data and publish the results. Attrition tends to be high, and patience tends to be low.

Simpson's Paradox

A classic example of confounding is Simpson's Paradox: what is true for the parts is not necessarily true for the whole. Freedman et al. (1997) give an example of prima facie gender bias in graduate admissions to the University of California at Berkeley (UCB). In 1973, 8,442 men and 4,321 women applied to graduate programs at UCB. About 44% of the men and 35% of the women were admitted. This looks like women might have been discriminated against, if applicants of both genders were equally qualified.

gives a breakdown of the applicants to the six largest departments, denoted A–F, which together account for more than a third of all applicants.

In most departments, women are admitted at a higher rate than men (C & E are the exceptions). How can women be admitted at a lower rate overall, if they are admitted at a higher rate in almost every department?

1973 U.C. Berkeley Graduate applications and admissions, 6 largest majors
	Men	Women
Major	Applied	% Admitted	Applied	% Admitted
A	825	62	108	82
B	560	63	25	68
C	325	37	593	34
D	417	33	375	35
E	191	28	393	24
F	373	6	341	7

A larger fraction of women than men apply to departments with low admission rates. Differences in the admission rates of departments show up as an apparent difference in the admission rates for different genders. The effect of gender is confounded with a difference in the admission rates in different departments.

What would be a more sensible measure of the gender-specific admissions rate than the overall rate? One possibility is a weighted average, where we assign greater weight to the departments with more applicants, as in .

The weighted average gives a more accurate impression of the relative acceptance rates: women were admitted at a higher rate by most departments.

Weighted Mean Graduate Admission Rates for Men and Women, 6 Largest Majors, U.C. Berkeley, 1973
Major	Applicant Pool	%men admitted	#appl.×%men	%women admitted	#appl.×%women
A	933	62	57,848	82	76,506
B	585	63	36,855	68	39,780
C	918	37	33,966	34	31,212
D	792	33	26,136	35	27,720
E	584	28	16,352	24	14,016
F	714	6	4284	7	4,998
Total A–F	4,526		175,441		194,232
Total/(Total Appl)	1		38.76		42.91

In computing a weighted average, the last step is always to divide by the sum of the weights. In an unweighted average, every observation implicitly gets the same weight, one; in that case, dividing by the sum of the weights is dividing by the number of things averaged, which leads to the usual mean.

The following exercise checks your understanding of Simpson’'s Paradox. The exercise is dynamic: The data tend to change when you reload the page.

Experiments and Observational Studies

To prevent confounding, the treatment and control groups should be alike in every regard that can affect the outcome, except the treatment. Then, differences between the outcomes for the treatment group and for the control group can be ascribed to the effect of the treatment, rather than to other variables that differ for the two groups. As a practical matter, it can be hard to ensure that the two groups are alike: Often nature, history, or the individuals themselves divide the treatment group from the control group. Moreover, sets of subjects usually do not come in matched pairs, one to assign to treatment and one to control—although identical twins are very popular medical subjects!

If the investigator gets to decide who will receive the treatment, the investigation is called an experiment. An experiment need not use the method of comparison to isolate the effect of treatment using controls, but good ones do. Some experiments merely select a collection of subjects, treat all of them, and report what happens. Experiments that use the method of comparison are called controlled experiments. Experiments that do not are called uncontrolled experiments. Inferences about the effect of treatment on the basis of uncontrolled experiments are suspect.

If the investigator merely observes whether subjects are treated, rather than selecting which subjects will be treated, the investigation is called an observational study. Observational studies also can be controlled or uncontrolled; better observational studies use controls. Generally, inferences from observational studies are less reliable than inferences from experiments—but see the section about John Snow's study of the mode of communication of cholera later in this chapter for an example of an observational study as compelling as the best controlled experiments.

Investigations of the effect of diet and other behavioral variables on human subjects tend to be observational studies rather than experiments, because it is hard to get people to do what you ask them to—and even harder to get them to keep doing it for months or years. For example, it is hard to find subjects who would be willing to smoke for 20 years or to refrain from smoking for 20 years, according to whether they were assigned to treatment or to control. The impossibility of an experiment to study the effect of gender or of height is even clearer: an investigator cannot intervene and change the gender or height of a subject. In the Simpson's paradox example earlier in the chapter, the investigators could not decide which applicants would be male and which female, nor could they decide which department each applicant would apply to—so the investigation is an observational study. Observational studies are prone to suffer from confounding.

The key difference between an experiment and an observational study is whether the researcher chooses who will receive the treatment. In an observational study, the researcher merely observes what happens to the treatment group, or compares those observations with observations of the control group, but the assignment to treatment or control is up to nature, historical accident, or the subjects themselves. "Choosing" is different from "classifying:" an investigator studying the effects of smoking on health can classify individuals as smokers or nonsmokers, but cannot choose who will smoke and who will not. That is why the controlled experiments on smoking are done with animals, and the human evidence is from observational studies.

Assessing Online Instruction

Let us continue our discussion of assessing the effectiveness of online instruction compared with a traditional Statistics class. How should we divide the students between the treatment group, who will use the online materials, and the control group, who will use traditional materials? Suppose we allow the students to choose whether to take a traditional course or a Web-based course. This would be an observational study. Students who choose the Web-based version might tend to be more technically inclined than those who choose the traditional course, and more technically inclined students might tend to do better in Statistics, regardless of the mode of instruction. This is an example of confounding from self-selection—the subjects decide whether they are to be in the treatment or control group, and factors that influence their decision also tend to influence the outcome.

We might instead try to balance the allocation of students between the two groups, for example, by having them fill out a questionnaire about their mathematical and computer background, interests, and skills, and deliberately trying to have a similar number from each level of background, interest, and skill in the treatment group and the control group. That is likely to be an improvement over self-selection, but much like a quota sample, it will tend to be biased by confounding with variables we have not thought of, cannot measure, or simply cannot balance within the population of students available to us for the experiment—but that are associated with the outcome we are studying.

However, if subjects are assigned at random to either treatment or control (for example, by tossing a coin), there is a tendency for differences other than the treatment to average out: differences among the subjects, including those differences that can effect the outcome, tend to be distributed "fairly" between the treatment and control groups, whether or not we know what those differences are. Randomized assignment thus is the preferred way to assign subjects to the treatment or control group. An experiment in which chance is deliberately introduced into the assignment of subjects to treatment and control (to mitigate possible biases, conscious or unconscious, that might otherwise affect the outcome) is called a randomized controlled experiment.

Professor Jerald G. Schutte of the Sociology Department at California State University, Northridge, used a randomized, controlled experiment to study the effectiveness of online Statistics instruction (see http://www.csun.edu/sociology/virexp.htm). Professor Schutte compared the performance of 33 students, 17 of whom were taught introductory Statistics for Sociology in a traditional lecture class, and 16 of whom were taught in a "virtual class" consisting of online instructional material and software for the students to have interactive real-time chats online. (The online materials were unrelated to this text.) Prof. Schutte found that the students in the virtual class did better, on the average. He attributed the difference to the fact that students in the virtual class spent more time studying the material, and that they used tools for collaborative learning.

There is an apocryphal story of testing a seasickness preventative, in which the captain of a ship was given a quantity of pills to evaluate during a cruise. Afterwards, he reported that the preventative worked perfectly: No one who took it got sick, but many who did not take it did get sick. It turned out that the captain had given the preventative only to the crew, because their health was more important to the functioning of the ship than was that of the passengers. For obvious reasons, that experiment does not prove anything about the effectiveness of the preventative: The control and treatment groups differed in a way that confounds with the treatment effect.

If the captain instead had decided randomly who would get the pill, with every individual having the same chance of being chosen, an approximately proportionate number of crew members would have been in the treatment and control groups. Differences between individuals that affect their propensity for seasickness (such as having spent much of their lives at sea) would tend to be distributed evenly across the treatment and control groups, leaving the pill as the primary difference between the groups. Then we might be able to draw conclusions about the effectiveness of the preventative.

The Placebo Effect

Especially when dealing with human subjects, the mere belief that one is receiving treatment can affect the outcome. (For example, it is well known that taking a sugar pill—a placebo with no pharmacological benefit— with the belief that it is a pain reliever actually reduces subjective pain. This is called the placebo effect.) For this reason, experiments often involve giving a placebo to the control group so that the only difference between the control and the treatment group is the contents of the "remedy" they are given, not whether they are given a "remedy." This is the essence of a blind experiment: the subjects do not know whether they are in the treatment or control group.

The conscious and unconscious hopes and expectations of the people who evaluate the subjects can also bias the results, especially when evaluating the outcome involves some subjective judgment, as in assessing whether a patient's condition has improved. (Most researchers do care about the outcome of their experiments!) For this reason, it is desirable that the person or persons assessing the subjects not know to which group the subjects belong. (Of course, someone has to keep track of which subjects comprise the treatment group and which comprise the control group in order ultimately to analyze the results.) When neither the subjects nor the evaluator(s) knows which group the subjects are in, the experiment is said to be double-blind. The best experiments on human subjects are randomized, controlled, and double-blind.

John Snow's study of the mode of communication of cholera: a natural experiment

That is, nature essentially performed a controlled randomized experiment, in which the control and treatment groups differed in their water supply, but not in other factors that might have affected the outcome. The assignment to treatment and control was not exactly random, but was effectively random, depending on accidents of history. This is called a natural experiment. Moreover, there was no possibility of self-selection, because individuals (at that time) could not choose their water supplier, nor where the water supplier drew its water: it was the Lambeth water company that changed its intake, and its customers were stuck with the result, good or bad. The "experiment" was essentially blind, because even those who knew their water supplier or knew that Lambeth changed its source were unlikely to have suspected a link between water quality and cholera. Snow's findings are in .

Cholera deaths by water source, London epidemic of 1853–1854. Snow\'s Table IX. Reproduced from Freedman, 1999.'
water supplier	houses	deaths from cholera	deaths per 10,000
Southwark and Vauxhall	40,046	1,263	315
Lambeth	26,107	98	37
rest of London	256,423	1,422	59

The rate of deaths from cholera among persons whose water came from the Southwark and Vauxhall company is roughly nine times larger than it is for persons whose water came from the Lambeth company. Other than the difference in water supplier, the buildings are alike with respect to demographics of the occupant, construction, etc. The sample was large (over 300,000 people), and it was representative of London as a whole. This is an example of applied statistics at its best.

Dowsers claim to be able to find water and minerals underground, or hidden or missing objects, typically using a forked stick called a dowsing rod. According to a 1997 article in Swift, a publication of the James Randi Educational Foundation, there were about 10,000 active dowsers in Germany alone, generating an estimated $50 million in annual revenue.

In 1989–1990, a scientific test of dowsing was performed in Kassel, Germany, under the auspices of the the German skeptic's society Gesellschaft zur wissenschaftlichen Untersuchung von Parawissenschaften (GWUP; Society for the Scientific Examination of Parasciences) by three scientists: Robert König, PhD (Justus-Liebig Universität, Gießen), Jürgen Moll, PhD (Vice-managing-director of GWUP), and Armardeo Sarma, Ing. (Scientific Collaborator at the Forschungsinstitut der Deutschen Bundespost Telekom, and Managing director of GWUP since 1987). The test is described in the January 1991 issue of Skeptiker, a publication of the GWUP (R. König, J. Moll, and A. Sarma, 1991. Wünschelruten-Test in Kassel, Skeptiker, pp.4–10; part of the article is translated into English in the article in Swift cited above). The experiment was funded by the Hessische Rundfunk TV network, who videotaped the experiment for a planned special on dowsing; the tests were performed on their property near their headquarters in Kassel, Germany. James Randi provided the prize money ($10,000) in the event that a participant could demonstrate successful dowsing (as defined by the rules of the test), and helped design the experiment.

The experiment began in 1989, with a press release announcing the experiment and the prize. The announcement sparked a number of newspaper articles, which led about 100 dowsers from a variety of European countries to respond to the GWUP. Based on what the responding dowsers claimed to be able to do, the scientists designed two experiments: one to test the ability of dowsers to determine whether water is flowing in a buried pipe (the water test), and one to test the ability of dowsers to sense hidden objects made of various materials (the box test).

The Water Test

The water test was designed to determine if the dowsers could detect whether water was flowing in a buried pipe. The experimental apparatus was a 130ft long network of pipes buried 20in deep, a valve, a pump, and two tanks. The diameter of the pipe was 2.5in. The ground above the pipe was marked with bright plastic tape, so the dowsers knew exactly where to look. One leg of one of the pipes ran through a 26ft by 20ft tent, where the dowsers were during the test. Using the valve, the experimenters could control whether 400 gallons of water flowed from the source tank to the receiving tank through a pipe buried under the tent, or through another pipe that did not go under the tent. After each trial, the pump moved the water back to the source tank. The network of pipes was roughly like

Water flows under or around tent, according to (random) valve setting

→ pipe →

← ← ← ← ← ← return pipe ← ← ← ← ← ←

↑
↑

r
e
t
u
r
n

p
i
p
e

↑
↑

↓
↓

r
e
t
u
r
n

p
i
p
e

↓
↓

At the start of each trial, the people operating the apparatus would draw a ping pong ball from a bag. A symbol on the ball determined which way they would set the valve in a given trial—which pipe the water would flow through. The bag, the control room, etc., were not visible to the dowsers.

Why two paths for the water?

The reason for having two paths for the water was to control for the noises, vibrations, and other cues to the fact that water was flowing at all. That isolated the experimental condition or "treatment" as much as possible: the difference between treatment and control was whether water was flowing through the pipe under the tent, not whether water was flowing at all.

The Box Test

In the box test, each dowser could choose one of the following materials to try to locate:

iron, coal, gold, silver, copper, magnet.

In each trial, an object composed of the material the dowser chose was hidden in one of 10 opaque plastic boxes placed in a row on a bench. Which box it was hidden in was determined by drawing a ping-pong ball from a bag, as in the water test. The dowser was to determine which of the 10 boxes contained the object. The other boxes were empty. This test took place indoors for all but one of the subjects.

The Rules

To participate in the experiment, each dowser had to sign the following declarations:

Before the test:

1) I declare that I have been given sufficient information about the tests by the GWUP and by James Randi both verbally and in writing. In pre-trial runs, I had the opportunity to adjust myself to the conditions, and I feel physically and psychically able to succeed in the test under the given circumstances.

After the test:

2) I declare that the tests were conducted impeccably. The test conditions and the schedule have in no way impeded me during the tests.

To win the prize, a dowser had to get 25 correct answers out of 30 trials in water experiment (83.3%), or 8 hits out of 10 trials in the box experiment (80%). All the dowsers claimed that they could "hit" in 90–100% of the cases, so by the rules of the test, they did not need to perform as well as they claimed they could to win the prize.

The Participants

Twenty-one dowsers said they would participate in the water test, but only 20 showed up. One of the 20 who showed said the environment had too much radiation, so he could not possibly work under the circumstances. Thus 19 dowsers participated in the water test.

Fourteen dowsers agreed to participate in the box test; one of them (the same one who backed out of the water test) refused to work under the conditions of the experiment. The person who complained about the radiation did the experiment, but in a way that differed from everyone else: for that person, the test was conducted outside, and involved 20 trials rather than 10.

The Results

The results were unremarkable and unspectacular.

Water Test

Protocol problems. Among all the trials in the water test, there were four mistakes in the valve setting (cases where the valve was not set in the direction the ping-pong ball specified). Three of those errors were discovered during the test; the fourth was discovered from the videotape of the test. In all four of those cases, the valve was set to "out" when it should have been set to "in."

Dowsers' success rates. Individual dowsers "hit" between 11 and 20 times out of 30. On average, they were right 53.3% of the time.

is a frequency table of the success rate of the dowsers.

Frequency of successes in the Kassel water test
Successes	No. of Dowsers
11	1
14	5
15	5
16	3
17	1
18	1
19	1
20	2
Total	19

Box Test

Recall that one of the participants had 20 trials rather than 10. That dowser never "hit."

Among the 13 other participants in the box test, there were 14 hits in all (10.8%). Including the specially treated dowser decreases the hit rate to 9.3%. See

Successes	No. of Dowsers
0	5
1	3
2	6
Total	14

None of the dowsers succeeded on either the water test or the box test according to the terms of the test. The prize was not awarded.

Analysis

We would like to know whether the results support the conclusion that dowsing works. The natural null hypothesis is that dowsing does not work. In that case, in the water test, the number of times a dowser guesses correctly which of the pipes the water is flowing through would be like the number of tickets labeled "1" one gets in 30 draws with replacement from a box of two tickets of which one is labeled "1" and one is labeled "0." That has a binomial distribution with parameters n=30 and p=50%. Similarly, under the null hypothesis, the number of times a dowser succeeds in locating the hidden object is like the number of tickets labeled "1" one gets in 10 draws with replacement from a box of two tickets of which one is labeled "1" and nine are labeled "0." That has a binomial distribution with parameters n=10 and p=10%.

Implicit in the rules is that the null hypothesis will be rejected if any of the dowsers gets the right answer 25 or more times out of 30 in the water test, or 8 times or more out of 10 in the box test.

We start by examining dowsers separately for each of the two tests. Because the drawing of the ping-pong balls is presumably independent each time, under the null hypothesis, the results for all the dowsers are independent. This simplicity makes it possible for us to analyze the experiment.

Under the null hypothesis, what is the chance that a particular dowser correctly identifies which pipe the water is flowing through 25 or more times out of 30? The probability is so small that it is indistinguishable from zero in the probability histogram; is the probability calculator, which you can use to make this calculation.

The chance is 0.016%. Thus the significance level of the water test for a single dowser is 0.016%.

What about power? We need a specific alternative hypothesis to calculate the power. The dowsers all claimed to be able to "hit" in 90% to 100% of the cases, so let us take as the alternative hypothesis that the trials are independent with chance 90% of success in each trial. Then the number of successes is like the number of tickets labeled "1" in 30 draws with replacement from a box of tickets containing nine tickets labeled "1" and one ticket labeled "0," which has a binomial distribution with parameters n=30 and = 90%. The chance of drawing 25 or more tickets labeled "1" in 30 independent draws from such a box is about 92.7%. Thus the power against the alternative hypothesis that dowsing increases to 90% the chance of successfully determining which pipe the water is flowing through is about 92.7%.

If the null hypothesis is true, the number of times in ten trials that a dowser correctly identifies which of the ten boxes contains the hidden substance has a binomial distribution with parameters n = 10 and p = 10%. The significance level of the test of a single dowser is the chance of 8 or more successes in 10 independent trials with probability 10% of success in each trial, which is about 0.00004%. That is the significance level of the box test of a single dowser. The power against the alternative hypothesis that the dowser's chance of correctly identifying the correct box is 90% is the chance of 8 or more successes in 10 independent trials with probability 90% of success in each trial, which is about 93%.

We shall ignore the "special" dowser who refused to do the water test and took 20 trials in the box test. The chance that if dowsing does not work (i.e., if the null hypothesis is true), one or more of the dowsers would pass the water test and/or the box test is (using the independence of the tests)

100% − chance none passes = 100% − (99.884%)¹⁹×(99.99996%)¹³

= 2.2%.

Thus even though the chance is about 0.016% that a given dowser with no ability will pass the water test, and about 0.00004% that a given dowser with no ability will pass the box test, the chance that if no dowser had any ability Randi still would have to pay the prize was over 1/50—many times larger. This is an example of multiplicity in hypothesis testing: the chance of one or more Type I errors in a large number of tests can be much larger than the chance of a Type I error in each test.

Under the alternative hypothesis that dowsing increases the chance of "hitting" to 90% for every dowser (all are equally skillful) and that the results of different dowsers are independent, the power of the combined water and box tests (of 19 and 13 subjects, respectively) is

100% − chance none passes = 100% − (7.3%)¹⁹×(7%)¹³

= 100% (to 37 decimal places).

Summary

Determining whether a treatment has an effect is an ubiquitous problem in science. The best way to determine whether a treatment has an effect is to use the method of comparison: to compare the outcome for subjects who are treated (the treatment group) with the outcome for the control subjects who are not treated (the control group). For such a comparison to isolate the effect of the treatment, it is crucial that the treatment and control groups be as similar as possible but for the treatment. Effects of differences between the treatment and the control group other than the treatment cannot be distinguished from effects of the treatment: Differences between the treatment and control groups can confound with the treatment.

Comparisons using historical controls, subjects from the past who did not receive treatment, tend to suffer from confounding. The effect of age tends to be confounded with other variables in cross-sectional comparisons, which seek to study the effect of age by comparing individuals of different ages at the same time. Cross-sectional comparisons tend to suffer from confounding. Longitudinal comparisons study the effects of age by comparing individuals with themselves at different ages. Longitudinal comparisons are less prone to confounding than cross-sectional comparisons are, but they are more expensive and more time consuming, because they the investigators must track the subjects over time. Simpson's Paradox is an illustration of confounding: What is true for the parts need not be true for the whole.

If the assignment of individuals to treatment is left to the subjects or to nature—the defining characteristic of an observational study—confounding tends to be a problem. However, in many situations there is no alternative, and observational studies sometimes provide compelling evidence that treatment has an effect, as John Snow's study of cholera shows. (In Snow's study, Nature mixed subjects between treatment and control in a way that mimicked a randomized, blind experiment: a natural experiment.) Even if the assignment to treatment is deliberate on the part of the investigator—the defining characteristic of an experiment—confounding can be a problem. Confounding caused by individual differences can be reduced by using the method of comparison with randomization to assign subjects to treatment or control. Randomized assignment tends to balance differences between the treatment and control groups, so that the overall difference between the outcomes for the two groups can be attributed more reliably to the treatment itself.

In experiments on human subjects, using a placebo can prevent the subjects from knowing whether they are in the treatment or the control group. This is important because the mere belief that one is being treated has an effect, called the placebo effect. Using a placebo makes the control and treatment groups alike with respect to the placebo effect. An experiment that uses a placebo to prevent subjects from knowing whether they are in the treatment or control group is called a blind experiment. If assessing the outcome for individual subjects involves an element of judgment, confounding tends to be reduced if the person assessing the outcome does not know which subjects are in the treatment group and which are in the control group. An experiment in which the subjects do not know whether they are in the treatment or control group, and in which the assessors do not know which subjects are in the treatment group and which are in the control group, is called a double-blind experiment. The best method for determining whether a treatment has an effect on human subjects is a randomized, controlled, double-blind experiment.