Causal Inference from Data

class: blueBack

## Causal Inference from Data

### Philip B. Stark
#### Department of Statistics, University of California, Berkeley
#### http://www.stat.berkeley.edu/~stark @philipbstark

### .white[Emerging Science for Environmental Health Decisions]
### Workshop on Advances in Causal Understanding of Human Health Risk-Based Decision Making
### National Academy of Sciences, Engineering, and Medicine
#### 6–7 March 2017, Washington, DC

---

## Abstract

What do we mean by "causality?" 
When are we justified in drawing causal conclusions from data? 
How much does experimental design matter? 
Are there algorithms that can draw reliable causal inferences from observational data?

---

## Types of Causation

+ Necessary

+ Sufficient

+ Contributory
    - probabilistic

---

## Causation and Hypothetical Counterfactuals—Hume

.framed.blue.center[But for the fact that X happened, Y would not have happened.]

+ Compare two scenarios, with and without the putative "cause" X.

+ How can you tell what would have happened, but for X?

---

## Probabilistic version

.framed.blue.center[But for the fact that X happened, Y would have been less likely to happen.]

+ Again, compare two scenarios, but much harder; repetition/replication implicit

+ `$ P \{ \mbox{X causes Y} \} $` means something quite different

---

## Quantities of interest

1. if all subjects were assigned to control, what would average response be?

2. if all subjects were assigned to treatment, what would average response be?

3. (2) - (1)

---

## Randomized controlled trials

+ Gold standard for causal inference

+ Can rigorously quantify chance of error

+ Random `$\ne$` haphazard

+ With randomization, confounders tend to balance (approximately); reliable
statistical inferences possible

---

## Neyman model for causal inference, binary treatment

Group of subjects, `$j$`th represented by a "ticket" with two numbers:

+ response if assigned to control: `$c_j$`

+ response if assigned to treatment: `$t_j$`

Assignment reveals exactly one of those responses.

+ Numbers on tickets fixed before assignment.

+ No assumption about distribution of the responses for any treatment.

+ No assumption about the nature of treatment effect (e.g., additive).

---

## Implicit: non-interference assumption

My response depends only on which treatment I get, and not on which treatment you get.

Can be unrealistic (e.g., vaccines for communicable diseases)

---

## Unbiased estimates

`$\mathcal{C}$`: indices of subjects assigned to control

`$\mathcal{T}$`: indices of subjects assigned to treatment

+ `$\bar{C} \equiv \frac{1}{|\mathcal{C}|} \sum_{j \in \mathcal{C}} c_j$` is unbiased estimate of mean response if all subjects assigned to control.

+ `$\bar{T} \equiv \frac{1}{|\mathcal{T}|} \sum_{j \in \mathcal{T}} t_j$` is unbiased estimate of mean response if all subjects assigned to treatment.

+ `$\bar{T} - \bar{C}$` is unbiased estimate of the difference (_ITT estimator_)

---

## Regression estimates

$$ Y_j = c + I_j \times \Delta + \epsilon_j, \;\;\;j=1, \ldots, N$$

with `$I_j = 1$` if subject `$j$` is treated and `$I_j=0$` if not

`$c$` is average response to control

`$\Delta$` is average increment to response from treatment

"Random errors" `$\{\epsilon_j\}$` assumed to be iid, zero-mean random variables, independent of `$\{I_j\}$`

Violates the actual design!

---

## Inference

"Strong" null hypothesis:

+ Subject by subject, treatment makes no difference whatsoever

"Weak" null hypothesis:

+ On average<sup>*</sup> (over the treated? over the population?), treatment makes no difference

---

## Example: Effect of treatment in a randomized controlled experiment

+ 11 pairs of rats, each pair from the same litter

+ Randomly—e.g., by coin tosses—put one of each pair into
  "enriched" environment; other sib gets "normal" environment.

+ After 65 days, measure cortical mass (mg)

| condition  | 1     | 2       | 3      | 4      | 5     | 6      | 7     | 8    | 9     | 10     | 11   |
| ---------- | ----- | ------- | ------ | ------ | ----- | ------ | ----- | ---- | ----- | ------ | ---- |
| treatment  | 689   | 656     | 668    | 660    | 679   | 663    | 664   | 647  | 694   | 633    | 653  |
| control    | 657   | 623     | 652    | 654    | 658   | 646    | 600   | 640  | 605   | 635    | 642  |
| difference | 32    | 33      | 16     | 6      | 21    | 17     | 64    | 7    | 89    | -2     | 11   |

+ How should we analyze the data?

Cartoon of Rosenzweig et al. 1972, Bennet et al. 1969.

---

## Informal Hypotheses

+ Null hypothesis: treatment has "no effect"

+ Alternative hypothesis: enrichment increases cortical mass

+ Suggests 1-sided test for an increase

---

## Test contenders

+ 2-sample Student `$t$`-test:

$$ \frac{\mbox{mean(treatment) - mean(control)}} {\mbox{pooled estimate of SD of difference of means}}
$$

+ 1-sample Student `$t$`-test on the differences: 
$$
\frac{\mbox{mean(differences)}}{\mbox{SD(differences)}/\sqrt{10}}
$$
Better, since littermates are presumably more homogeneous.

--
    
+ Permutation test using `$t$`-statistic of differences:
same statistic, different way to calculate `$P$`-value

Even better?

---

## Strong null hypothesis

+ Treatment has no effect whatsoever—as if cortical mass were 
assigned to each rat before the randomization

Then equally likely that the rat with the heavier cortex will be assigned
to treatment or to control, independently across littermate pairs

Gives `$2 {11} = 2,048$` equally likely possibilities:

| difference  | `$\pm$`32 | `$\pm$`33 | `$\pm$`16 | `$\pm$`6 | `$\pm$`21 | `$\pm$`17 | `$\pm$`64 | `$\pm$`7 | `$\pm$`89 | `$\pm$`2 | `$\pm$`11|
|---|---|---|---|---|---|---|---|---|---|---|

E.g., just as likely to observe original differences as

| difference | -32 | -33 | -16 | -6 | -21 | -17 | -64 | -7 | -89 | -2 | -11 |
|---|---|---|---|---|---|---|---|---|---|---|

---

## Weak null hypothesis

+ On average across pairs, treatment makes no difference

---

## Alternatives

+ Individual's response depends only on that individual's assignment

+ Special cases: shift, scale, etc.

+ Interactions/Interference: my response could depend on whether you are assigned to treatment or control

---

## Assumptions of the tests

+ 2-sample `$t$`-test: 
    - masses are iid sample from normal distribution, same unknown variance, same unknown mean.
    - tests weak null hypothesis (plus normality, independence, non-interference, etc.)

--
    
+ 1-sample `$t$`-test on the differences: 
    - mass differences are iid sample from normal distribution, unknown variance, zero mean.
    - tests weak null hypothesis (plus normality, independence, non-interference, etc.)

Both are .blue["cargo-cult" statistics]: mechanically plugging numbers into formulae that
aren't connected to the experiment; `$P$`-values meaningless

--
    
+ Permutation test: 
    - Randomization fair, independent across pairs
    - tests strong null hypothesis
    - assumptions true by fiat

---

## Student `$t$`-test calculations

`$P$`-value for 1-sided `$t$`-test:  0.0044

+ Why do cortical weights have normal distribution?

+ Why is variance of the difference between treatment and control
the same for different litters?

+ Treatment and control are _dependent_ because assigning
a rat to treatment excludes it from the control group, and vice versa.

+ Does `$P$`-value depend on assuming differences 
are iid sample from a normal distribution?  If we reject the null, is that because
there is a treatment effect, or because the other assumptions are wrong?

---

## Permutation `$t$`-test calculations

+ Could enumerate all `$ 2^{11} = 2,048$` equally likely possibilities, calculate `$t$`-statistic for each.

+ `$P = \frac{\mbox{number of possibilities with } t \ge 3.093}{2,048}$`

+ For more pairs, impractical to enumerate, but can simulate: `$P \approx 0.0011 < 0.0044$`

---

## Still more tests, for other alternatives

Tests so far are sensitive to _shifts_—the alternative
hypothesis is that treatment increases response (cortical mass).

There are also nonparametric tests that are sensitive to other
treatment effects, e.g., treatment increases variability of the 
response.

And there are tests for whether treatment has any effect at all on
the distribution of the responses.

Can design test statistic to be sensitive to any change that
interests you, then use permutation distribution to get `$P$`-value
(and simulation to approximate that `$P$`-value).

---

## Back to Rosenzweig et al.

+ Actually had 3 treatments:  enriched, standard, deprived.

+ Randomized 3 rats per litter into the 3 treatments, independently across
  `$n$` litters.

## Test contenders `$n$` litters, `$s$` treatments (sibs/litter)

+ ANOVA—the `$F$`-test:
$$ F = \frac{\mbox{BSS}/(s-1)}{\mbox{WSS}/(n-s)} $$

+ Regression with dummy variable for treatment

+ Permutation `$F$`-test

---

## Strong null hypothesis

+ Treatment has no effect whatsoever—as if cortical mass were 
    assigned to each rat before the randomization

+ Then equally likely that each littermate is assigned to each treatment, 
    independently across litters

+ `$3! = 6$` assignments of each triple to treatments.

`$6^n$` equally likely assignments across all litters

For 11 litters, 362,797,056 possibilities

---

## Weak null hypothesis

+ Average cortical weight for all three treatment groups are equal.
  On average across triples, treatment makes no difference.

---

## Assumptions of the tests

+ `$F$`-test: 
    - masses are iid sample from normal distribution, same unknown variance, same unknown mean for all litters and treatments.  
    - tests weak null hypothesis.
    
--

+ Regression with dummy variables:
    - effect is additive: same 'delta' for all
    - regression model is response schedule
    - errors iid, zero-mean
    
--

+ Permutation `$F$`-test:
    - randomization was as advertised: fair, independent across triples
    - tests strong null hypothesis.

---

## `$F$`-test assumptions—reasonable?

+ Why do cortical weights have normal distribution for each
  litter and for each treatment?
  
--

+ Why is the variance of cortical weights the same for different
  litters?
--

+ Why is the variance of cortical weights the same for
  different treatments?

---

## Is `$F$` a good statistic for this alternative?

+ `$F$` sensitive to differences among the 
  mean responses for each treatment group, no matter what pattern the differences 
  have.

But the treatments and the responses can be ordered: we hypothesize that
  more stimulation produces greater cortical mass.

deprived  `$\Longrightarrow$`  normal  `$\Longrightarrow$`  enriched

low mass  `$\Longrightarrow$`  medium mass  `$\Longrightarrow$`  high mass

+ Can we use that to make a more sensitive test?

---

## A test against an ordered alternative

+ Within each litter triple, count pairs of responses
  that are "in order."  Sum across litters.

E.g., if one triple had cortical masses

| condition | mass |
| :-------- | ---: |
| deprived  |  640 |
| normal    |  660 |
| enriched  |  650 |

that would contribute 2 to the sum: `$660 \ge 640$`, `$650 \ge 640$`, but `$640 < 650$`.

Each litter triple contributes between 0 and 3 to the overall sum.

Null distribution for the test based on the permutation distribution: 6
equally likely assignments per litter, independent across litters.

---

## A different test against an ordered alternative

Within each litter triple, add differences
that are "in order."  Sum over triples.

E.g., if one triple had cortical masses
.center[
| condition | mass |
| :-------- | ---: |
| deprived  |  640 |
| normal    |  660 |
| enriched  |  650 |
]

that would contribute 30 to the sum: `$660 - 640 = 20$` and `$650 - 640 = 10$`, but `$640 < 650$`,
so that pair contributes zero.

Each litter triple contributes between 0 and `$2\times{\mbox{ range }}$` to the sum.

Null distribution for the test based on the permutation distribution: 6
equally likely assignments per litter, independent across litters.

---

## Association is not causation

Even if it's really strong association

---

## Freedman's Rabbit-Hat Theorem

.framed.blue.center[To pull a rabbit out of a hat, at least one rabbit must first be placed in the hat.]

---

## Incomplete bestiary

+ regression

+ path models

+ simultaneous equation models

+ DAGs; e.g., TETRAD

---

## Path model

<div class="mermaid" id="not_dag">
graph LR;
A-->B;
B-->C;
C-->A;
D-->C;
</div>

---

## DAG

<div class="mermaid" id="dag">
graph LR;
A-->C;
B-->C;
C-->E;
D-->E;
E-->F;
</div>

---

## Rabbits, Association, and Causation

TETRAD (Spirtes, Glymour, Schein, 1996) involves several rabbits:

+ re-defines "causality"

+ Markov condition

+ faithfulness assumption

+ causal Markov condition

+ assumption that approximate sample independence is exact population independence

---

## Rabbits, Association, and Causation, contd.

Regression (including logistic regression)

+ response schedules

+ stochastic assumptions

+ Example: utility customers

---

## Epistemic leaps

+ Rates are not probabilities

+ Logit/probit models: what does "probabiity" mean?

+ Description `$\ne$` prediction `$\ne$` response to intervention

---

## Legal setting

+ General causation

+ Specific causation

---

## Standard for specific causation: relative risk `$>2$`

+ Compare rate of illness in "exposed" to rate in "unexposed."

+ If `$\mbox{RR} >2$`, conclude that "more likely than not" exposure caused given
    subject's illness.

+ Even given general causation, heterogeneity `$\longrightarrow$` confounding:

$$ \mbox{Ave }Pr(\mbox{illness caused by exposure} | \mbox{illness and exposure}) $$

can be as small as the _difference_ in risk between exposed & unexposed.

---

## Snow (1855) on how cholera is communicated

Following cholera outbreak in 1854, mapped residences of victims.

Concentrated near Broad Street public water pump in Soho.

A few buildings in the area were relatively unaffected by cholera; 
their water suppliers were different (a brewery and a poorhouse, both of which had their
own pumps).

Snow showed that most cholera victims in other parts of London had drunk from
Broad Street pump.

---

At the time, several water companies in London.

Companies drew their water from different parts of the Thames, and
treated the water differently.

Cholera more prevalent in buildings served by water companies
who drew their water from dirty parts of the river, with the exception of one
company, which purified its water effectively.

Lambeth company started drawing its water further upstream
in 1852.

Snow compared rates of cholera in the 1853–1854 epidemics
with earlier epidemics, when Lambeth drew its water further downstream,
along with one of its competitors, the Southwark and Vauxhall company.

Which buildings were served by which water company was largely
accidental: not much difference that could
account for the differences in the rate of cholera.

---

"Although the
above facts shown in the table above afford very strong evidence of the
powerful influence which the drinking of water containing the sewage of
a town exerts over the spread of cholera, when that disease is present,
yet the question does not end here; for the intermixing of the water supply
of the Southwark and Vauxhall Company with that of the Lambeth Company,
over an extensive part of London, admitted of the subject being sifted
in such a way as to yield the most incontrovertible proof on one side or
the other.

In the subdistricts enumerated in the above table as being supplied
by both Companies, the mixing of the supply is of the most intimate kind.
The pipes of each company go down all the streets, and into nearly all
the courts and alleys.

A few houses are supplied by one Company and a few
by the other, according to the decision of the owner or occupier at that
time when the Water Companies were in active competition.
In many cases a single house has a supply different from that on either side.
Each company supplies both rich and poor, both large houses and small; there
is no difference either in the condition or occupation of the persons receiving
the water of the different Companies.

Now it must be evident
that, if the diminution of cholera, in the districts partly supplied with
improved water, depended on this supply, the houses receiving it would
be the houses enjoying the whole benefit of the diminutions of the malady,
whilst the houses supplied by the water from Battersea Fields would suffer
the same mortality as they would if the improved supply did not exist
at all."

---

"As there is no difference whatever in the houses or the people
receiving the supply of the two Water Companies, or in any of the physical
conditions with which they are surrounded, it is obvious that no experiment
could have been devised which would more thoroughly test the effect of
water supply on the progress of cholera than this, which circumstances
placed ready made before the observer.

The experiment, too, was on the grandest scale.  No fewer than three
hundred thousand people of both sexes, of every age and occupation,
and of every rank and station, from gentlefolks down to the very poor,
were divided into groups without their choice, and in most cases,
without their knowledge; one group being supplied with water containing the
sewage of London, and amongst it, whatever might have come from the
cholera patients; the other group having water quite free from such impurity.

To turn this grand experiment to account, all that was required was to learn the supply
of water to each individual house where a fatal attack of cholera might occur."

---

<table>
<caption>Cholera deaths, London epidemic of 1853–1854. Snow Table IX</caption>
<tr>
<th id="col0" align="center">water supplier</th>
<th id="col1" align="center">houses</th>
<th id="col2" align="center">deaths from cholera</th>
<th id="col3" align="center">deaths per 10,000</th>
</tr>
<tr>
<td headers="col0" align="left">
Southwark and Vauxhall
</td>
<td headers="col1" align="right">
40,046
</td>
<td headers="col2" align="right">
1,263
</td>
<td headers="col3" align="right">
315
</td>
</tr>
<tr>
<td headers="col0" align="left">
Lambeth
</td>
<td headers="col1" align="right">
26,107
</td>
<td headers="col2" align="right">
98
</td>
<td headers="col3" align="right">
37
</td>
</tr>
<tr>
<td headers="col0" align="left">
rest of London
</td>
<td headers="col1" align="right">
256,423
</td>
<td headers="col2" align="right">
1,422
</td>
<td headers="col3" align="right">
59
</td>
</tr>
</table>

---

## Conclusions

+ In observational studies, confounding is the rule, not the exception

+ Reliable causal inference requires good experimental design
    - Randomized experiments are the gold standard
    - Experiments should be analyzed with methods that honor the actual design (not regression)
    - Sophisticated analyses cannot compensate for bad experimental design
        + Better to think of algorithmic results as conjectures to be verified
    - Rarely, a "natural experiment" is as good as a randomized experiment
    
+ To pull a (causal) rabbit from a (data) hat, the rabbit must first be placed in the hat. Look
for how the rabbit enters the hat.

+ Relative risk is not the same as probability of specific causation

.center.blue.framed.large[Guess, but verify.]

---

## Reading list

D.A. Freedman, 2008. On types of scientific enquiry: Nine success stories in medical research,
_The Oxford Handbook of Political Methodology_, 300-318. Janet M. Box-Steffensmeier, 
Henry E. Brady and David Collier, editors.

D.A. Freedman, 2006. Statistical models for causation: What inferential leverage do they provide?, 
_Evaluation Review_, _30_, 691–713.

D.A. Freedman and P. Humphreys, 1999. Are there algorithms that discover causal structure?,
_Synthese_, _121_, 29–54.

D.A. Freedman, 1999. From association to causation: Some remarks on the history of statistics, 
_Statistical Science_, _14_, 243-58.

D.A. Freedman and P.B. Stark, 2001. The Swine Flu Vaccine and Guillain-Barré Syndrome: 
A Case Study in Relative Risk and Specific Causation, _Law and Contemporary Problems_, _64_, 49–64.
http://scholarship.law.duke.edu/lcp/vol64/iss4/3

Snow, J., 1855. _On the Mode of Communication of Cholera_. 
Churchill, London (reprinted 1965 by Hafner, N.Y.)