class: blueBack ## Causal Inference from Data ### Philip B. Stark #### Department of Statistics, University of California, Berkeley #### http://www.stat.berkeley.edu/~stark @philipbstark ### .white[Emerging Science for Environmental Health Decisions] ### Workshop on Advances in Causal Understanding of Human Health Risk-Based Decision Making ### National Academy of Sciences, Engineering, and Medicine #### 6–7 March 2017, Washington, DC --- ## Abstract What do we mean by "causality?" When are we justified in drawing causal conclusions from data? How much does experimental design matter? Are there algorithms that can draw reliable causal inferences from observational data? --- ## Types of Causation + Necessary -- + Sufficient -- + Contributory - probabilistic --- ## Causation and Hypothetical Counterfactuals—Hume .framed.blue.center[But for the fact that X happened, Y would not have happened.] -- + Compare two scenarios, with and without the putative "cause" X. -- + How can you tell what would have happened, but for X? --- ## Probabilistic version .framed.blue.center[But for the fact that X happened, Y would have been less likely to happen.] -- + Again, compare two scenarios, but much harder; repetition/replication implicit -- + `\( P \{ \mbox{X causes Y} \} \)` means something quite different --- ## Quantities of interest 1. if all subjects were assigned to control, what would average response be? -- 2. if all subjects were assigned to treatment, what would average response be? -- 3. (2) - (1) --- ## Randomized controlled trials + Gold standard for causal inference -- + Can rigorously quantify chance of error -- + Random `\(\ne\)` haphazard -- + With randomization, confounders tend to balance (approximately); reliable statistical inferences possible --- ## Neyman model for causal inference, binary treatment Group of subjects, `\(j\)`th represented by a "ticket" with two numbers: -- + response if assigned to control: `\(c_j\)` -- + response if assigned to treatment: `\(t_j\)` -- Assignment reveals exactly one of those responses. -- + Numbers on tickets fixed before assignment. -- + No assumption about distribution of the responses for any treatment. -- + No assumption about the nature of treatment effect (e.g., additive). --- ## Implicit: non-interference assumption My response depends only on which treatment I get, and not on which treatment you get. -- Can be unrealistic (e.g., vaccines for communicable diseases) --- ## Unbiased estimates `\(\mathcal{C}\)`: indices of subjects assigned to control `\(\mathcal{T}\)`: indices of subjects assigned to treatment -- + `\(\bar{C} \equiv \frac{1}{|\mathcal{C}|} \sum_{j \in \mathcal{C}} c_j\)` is unbiased estimate of mean response if all subjects assigned to control. -- + `\(\bar{T} \equiv \frac{1}{|\mathcal{T}|} \sum_{j \in \mathcal{T}} t_j\)` is unbiased estimate of mean response if all subjects assigned to treatment. -- + `\(\bar{T} - \bar{C}\)` is unbiased estimate of the difference (_ITT estimator_) --- ## Regression estimates $$ Y_j = c + I_j \times \Delta + \epsilon_j, \;\;\;j=1, \ldots, N$$ with `\(I_j = 1\)` if subject `\(j\)` is treated and `\(I_j=0\)` if not -- `\(c\)` is average response to control -- `\(\Delta\)` is average increment to response from treatment -- "Random errors" `\(\{\epsilon_j\}\)` assumed to be iid, zero-mean random variables, independent of `\(\{I_j\}\)` -- Violates the actual design! --- ## Inference "Strong" null hypothesis: + Subject by subject, treatment makes no difference whatsoever -- "Weak" null hypothesis: + On average
*
(over the treated? over the population?), treatment makes no difference --- ## Example: Effect of treatment in a randomized controlled experiment + 11 pairs of rats, each pair from the same litter + Randomly—e.g., by coin tosses—put one of each pair into "enriched" environment; other sib gets "normal" environment. + After 65 days, measure cortical mass (mg) | condition | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | | ---------- | ----- | ------- | ------ | ------ | ----- | ------ | ----- | ---- | ----- | ------ | ---- | | treatment | 689 | 656 | 668 | 660 | 679 | 663 | 664 | 647 | 694 | 633 | 653 | | control | 657 | 623 | 652 | 654 | 658 | 646 | 600 | 640 | 605 | 635 | 642 | | difference | 32 | 33 | 16 | 6 | 21 | 17 | 64 | 7 | 89 | -2 | 11 | + How should we analyze the data? Cartoon of Rosenzweig et al. 1972, Bennet et al. 1969. --- ## Informal Hypotheses + Null hypothesis: treatment has "no effect" + Alternative hypothesis: enrichment increases cortical mass + Suggests 1-sided test for an increase --- ## Test contenders + 2-sample Student `\(t\)`-test: $$ \frac{\mbox{mean(treatment) - mean(control)}} {\mbox{pooled estimate of SD of difference of means}} $$ -- + 1-sample Student `\(t\)`-test on the differences: $$ \frac{\mbox{mean(differences)}}{\mbox{SD(differences)}/\sqrt{10}} $$ Better, since littermates are presumably more homogeneous. -- + Permutation test using `\(t\)`-statistic of differences: same statistic, different way to calculate `\(P\)`-value Even better? --- ## Strong null hypothesis + Treatment has no effect whatsoever—as if cortical mass were assigned to each rat before the randomization Then equally likely that the rat with the heavier cortex will be assigned to treatment or to control, independently across littermate pairs Gives `\(2 {11} = 2,048\)` equally likely possibilities: | difference | `\(\pm\)`32 | `\(\pm\)`33 | `\(\pm\)`16 | `\(\pm\)`6 | `\(\pm\)`21 | `\(\pm\)`17 | `\(\pm\)`64 | `\(\pm\)`7 | `\(\pm\)`89 | `\(\pm\)`2 | `\(\pm\)`11| |---|---|---|---|---|---|---|---|---|---|---| E.g., just as likely to observe original differences as | difference | -32 | -33 | -16 | -6 | -21 | -17 | -64 | -7 | -89 | -2 | -11 | |---|---|---|---|---|---|---|---|---|---|---| --- ## Weak null hypothesis + On average across pairs, treatment makes no difference --- ## Alternatives + Individual's response depends only on that individual's assignment + Special cases: shift, scale, etc. + Interactions/Interference: my response could depend on whether you are assigned to treatment or control --- ## Assumptions of the tests + 2-sample `\(t\)`-test: - masses are iid sample from normal distribution, same unknown variance, same unknown mean. - tests weak null hypothesis (plus normality, independence, non-interference, etc.) -- + 1-sample `\(t\)`-test on the differences: - mass differences are iid sample from normal distribution, unknown variance, zero mean. - tests weak null hypothesis (plus normality, independence, non-interference, etc.) -- Both are .blue["cargo-cult" statistics]: mechanically plugging numbers into formulae that aren't connected to the experiment; `\(P\)`-values meaningless -- + Permutation test: - Randomization fair, independent across pairs - tests strong null hypothesis - assumptions true by fiat --- ## Student `\(t\)`-test calculations `\(P\)`-value for 1-sided `\(t\)`-test: 0.0044 + Why do cortical weights have normal distribution? -- + Why is variance of the difference between treatment and control the same for different litters? -- + Treatment and control are _dependent_ because assigning a rat to treatment excludes it from the control group, and vice versa. -- + Does `\(P\)`-value depend on assuming differences are iid sample from a normal distribution? If we reject the null, is that because there is a treatment effect, or because the other assumptions are wrong? --- ## Permutation `\(t\)`-test calculations + Could enumerate all `\( 2^{11} = 2,048\)` equally likely possibilities, calculate `\(t\)`-statistic for each. + `\(P = \frac{\mbox{number of possibilities with } t \ge 3.093}{2,048}\)` + For more pairs, impractical to enumerate, but can simulate: `\(P \approx 0.0011 < 0.0044\)` --- ## Still more tests, for other alternatives Tests so far are sensitive to _shifts_—the alternative hypothesis is that treatment increases response (cortical mass). There are also nonparametric tests that are sensitive to other treatment effects, e.g., treatment increases variability of the response. And there are tests for whether treatment has any effect at all on the distribution of the responses. Can design test statistic to be sensitive to any change that interests you, then use permutation distribution to get `\(P\)`-value (and simulation to approximate that `\(P\)`-value). --- ## Back to Rosenzweig et al. + Actually had 3 treatments: enriched, standard, deprived. + Randomized 3 rats per litter into the 3 treatments, independently across `\(n\)` litters. ## Test contenders `\(n\)` litters, `\(s\)` treatments (sibs/litter) + ANOVA—the `\(F\)`-test: $$ F = \frac{\mbox{BSS}/(s-1)}{\mbox{WSS}/(n-s)} $$ + Regression with dummy variable for treatment + Permutation `\(F\)`-test --- ## Strong null hypothesis + Treatment has no effect whatsoever—as if cortical mass were assigned to each rat before the randomization + Then equally likely that each littermate is assigned to each treatment, independently across litters + `\(3! = 6\)` assignments of each triple to treatments. `\(6^n\)` equally likely assignments across all litters For 11 litters, 362,797,056 possibilities --- ## Weak null hypothesis + Average cortical weight for all three treatment groups are equal. On average across triples, treatment makes no difference. --- ## Assumptions of the tests + `\(F\)`-test: - masses are iid sample from normal distribution, same unknown variance, same unknown mean for all litters and treatments. - tests weak null hypothesis. -- + Regression with dummy variables: - effect is additive: same 'delta' for all - regression model is response schedule - errors iid, zero-mean -- + Permutation `\(F\)`-test: - randomization was as advertised: fair, independent across triples - tests strong null hypothesis. --- ## `\(F\)`-test assumptions—reasonable? + Why do cortical weights have normal distribution for each litter and for each treatment? -- + Why is the variance of cortical weights the same for different litters? -- + Why is the variance of cortical weights the same for different treatments? --- ## Is `\(F\)` a good statistic for this alternative? + `\(F\)` sensitive to differences among the mean responses for each treatment group, no matter what pattern the differences have. But the treatments and the responses can be ordered: we hypothesize that more stimulation produces greater cortical mass. -- deprived `\(\Longrightarrow\)` normal `\(\Longrightarrow\)` enriched low mass `\(\Longrightarrow\)` medium mass `\(\Longrightarrow\)` high mass -- + Can we use that to make a more sensitive test? --- ## A test against an ordered alternative + Within each litter triple, count pairs of responses that are "in order." Sum across litters. E.g., if one triple had cortical masses | condition | mass | | :-------- | ---: | | deprived | 640 | | normal | 660 | | enriched | 650 | that would contribute 2 to the sum: `\(660 \ge 640\)`, `\(650 \ge 640\)`, but `\(640 < 650\)`. -- Each litter triple contributes between 0 and 3 to the overall sum. Null distribution for the test based on the permutation distribution: 6 equally likely assignments per litter, independent across litters. --- ## A different test against an ordered alternative Within each litter triple, add differences that are "in order." Sum over triples. E.g., if one triple had cortical masses .center[ | condition | mass | | :-------- | ---: | | deprived | 640 | | normal | 660 | | enriched | 650 | ] that would contribute 30 to the sum: `\(660 - 640 = 20\)` and `\(650 - 640 = 10\)`, but `\(640 < 650\)`, so that pair contributes zero. -- Each litter triple contributes between 0 and `\(2\times{\mbox{ range }}\)` to the sum. Null distribution for the test based on the permutation distribution: 6 equally likely assignments per litter, independent across litters. --- ## Association is not causation Even if it's really strong association --- ## Freedman's Rabbit-Hat Theorem .framed.blue.center[To pull a rabbit out of a hat, at least one rabbit must first be placed in the hat.] --- ## Incomplete bestiary + regression + path models + simultaneous equation models + DAGs; e.g., TETRAD --- ## Path model
graph LR; A-->B; B-->C; C-->A; D-->C;
--- ## DAG
graph LR; A-->C; B-->C; C-->E; D-->E; E-->F;
--- ## Rabbits, Association, and Causation TETRAD (Spirtes, Glymour, Schein, 1996) involves several rabbits: + re-defines "causality" + Markov condition + faithfulness assumption + causal Markov condition + assumption that approximate sample independence is exact population independence --- ## Rabbits, Association, and Causation, contd. Regression (including logistic regression) + response schedules + stochastic assumptions + Example: utility customers --- ## Epistemic leaps + Rates are not probabilities + Logit/probit models: what does "probabiity" mean? + Description `\(\ne\)` prediction `\(\ne\)` response to intervention --- ## Legal setting + General causation + Specific causation --- ## Standard for specific causation: relative risk `\(>2\)` + Compare rate of illness in "exposed" to rate in "unexposed." + If `\(\mbox{RR} >2\)`, conclude that "more likely than not" exposure caused given subject's illness. + Even given general causation, heterogeneity `\(\longrightarrow\)` confounding: $$ \mbox{Ave }Pr(\mbox{illness caused by exposure} | \mbox{illness and exposure}) $$ can be as small as the _difference_ in risk between exposed & unexposed. --- ## Snow (1855) on how cholera is communicated Following cholera outbreak in 1854, mapped residences of victims. Concentrated near Broad Street public water pump in Soho. A few buildings in the area were relatively unaffected by cholera; their water suppliers were different (a brewery and a poorhouse, both of which had their own pumps). Snow showed that most cholera victims in other parts of London had drunk from Broad Street pump. --- At the time, several water companies in London. Companies drew their water from different parts of the Thames, and treated the water differently. Cholera more prevalent in buildings served by water companies who drew their water from dirty parts of the river, with the exception of one company, which purified its water effectively. Lambeth company started drawing its water further upstream in 1852. Snow compared rates of cholera in the 1853–1854 epidemics with earlier epidemics, when Lambeth drew its water further downstream, along with one of its competitors, the Southwark and Vauxhall company. Which buildings were served by which water company was largely accidental: not much difference that could account for the differences in the rate of cholera. --- "Although the above facts shown in the table above afford very strong evidence of the powerful influence which the drinking of water containing the sewage of a town exerts over the spread of cholera, when that disease is present, yet the question does not end here; for the intermixing of the water supply of the Southwark and Vauxhall Company with that of the Lambeth Company, over an extensive part of London, admitted of the subject being sifted in such a way as to yield the most incontrovertible proof on one side or the other. In the subdistricts enumerated in the above table as being supplied by both Companies, the mixing of the supply is of the most intimate kind. The pipes of each company go down all the streets, and into nearly all the courts and alleys. A few houses are supplied by one Company and a few by the other, according to the decision of the owner or occupier at that time when the Water Companies were in active competition. In many cases a single house has a supply different from that on either side. Each company supplies both rich and poor, both large houses and small; there is no difference either in the condition or occupation of the persons receiving the water of the different Companies. Now it must be evident that, if the diminution of cholera, in the districts partly supplied with improved water, depended on this supply, the houses receiving it would be the houses enjoying the whole benefit of the diminutions of the malady, whilst the houses supplied by the water from Battersea Fields would suffer the same mortality as they would if the improved supply did not exist at all." --- "As there is no difference whatever in the houses or the people receiving the supply of the two Water Companies, or in any of the physical conditions with which they are surrounded, it is obvious that no experiment could have been devised which would more thoroughly test the effect of water supply on the progress of cholera than this, which circumstances placed ready made before the observer. The experiment, too, was on the grandest scale. No fewer than three hundred thousand people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into groups without their choice, and in most cases, without their knowledge; one group being supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients; the other group having water quite free from such impurity. To turn this grand experiment to account, all that was required was to learn the supply of water to each individual house where a fatal attack of cholera might occur." ---
Cholera deaths, London epidemic of 1853–1854. Snow Table IX
water supplier
houses
deaths from cholera
deaths per 10,000
Southwark and Vauxhall
40,046
1,263
315
Lambeth
26,107
98
37
rest of London
256,423
1,422
59
--- ## Conclusions + In observational studies, confounding is the rule, not the exception + Reliable causal inference requires good experimental design - Randomized experiments are the gold standard - Experiments should be analyzed with methods that honor the actual design (not regression) - Sophisticated analyses cannot compensate for bad experimental design + Better to think of algorithmic results as conjectures to be verified - Rarely, a "natural experiment" is as good as a randomized experiment + To pull a (causal) rabbit from a (data) hat, the rabbit must first be placed in the hat. Look for how the rabbit enters the hat. + Relative risk is not the same as probability of specific causation .center.blue.framed.large[Guess, but verify.] --- ## Reading list D.A. Freedman, 2008. On types of scientific enquiry: Nine success stories in medical research, _The Oxford Handbook of Political Methodology_, 300-318. Janet M. Box-Steffensmeier, Henry E. Brady and David Collier, editors. D.A. Freedman, 2006. Statistical models for causation: What inferential leverage do they provide?, _Evaluation Review_, _30_, 691–713. D.A. Freedman and P. Humphreys, 1999. Are there algorithms that discover causal structure?, _Synthese_, _121_, 29–54. D.A. Freedman, 1999. From association to causation: Some remarks on the history of statistics, _Statistical Science_, _14_, 243-58. D.A. Freedman and P.B. Stark, 2001. The Swine Flu Vaccine and Guillain-BarrĂ© Syndrome: A Case Study in Relative Risk and Specific Causation, _Law and Contemporary Problems_, _64_, 49–64. http://scholarship.law.duke.edu/lcp/vol64/iss4/3 Snow, J., 1855. _On the Mode of Communication of Cholera_. Churchill, London (reprinted 1965 by Hafner, N.Y.)