Pay no attention to the model behind the curtain

class:textShadow
<img src="./ClimPics/curtain.jpg" />

#### Philip B. Stark, www.stat.berkeley.edu/~stark Department of Statistics, University of California, Berkeley

#### Significant Digits: Responsible Use of Quantitative Information European Commission Joint Research Centre, Brussels, 9–10 June 2015

---
### Pay No Attention to the Model Behind the Curtain
### Abstract

Watch me pull a probability out of my model ... Presto!
Typical attempts to quantify risk for policy makers involve inventing a stochastic model
for a phenomenon; fitting some parameters in that model to data;
then declaring that features of the fitted model, called "probabilities" within the model,
magically apply to the real world.
Pulling this probability rabbit from the analyst's hat
generally involves several statistical and philosophical sleights of hand:
confusing the map (the model) with the territory (the phenomenon),
confusing rates with probabilities, and distracting attention from the moment that
probability entered the hat (i.e., the moment the stochastic model was assumed
to have generated the data).
Bedazzling the onlookers with a sparkly array of Greek symbols,
heroic high-performance computing, and superficial attempts to quantify the
uncertainty renders the show all the more dramatic.

---
### Abstract 2: Preproducibility

"Reproducibility" and "replicability" have orthogonal or even
contradictory meanings across disciplines.
For instance, sometimes they mean that an experimental or computational
result is repeatable, and sometimes only that there are enough "breadcrumbs" to attempt
to repeat the undertaking.
This amounts to confusing whether a process can be audited with whether the process passed an audit.
The neologism "preproducibility" might help distinguish these.
Preproducibility is a prerequisite for attempting to reproduce a result:
it involves providing an adequate description of an experiment or analysis
for the work to be re-undertaken. It requires documentation, openness, and communication.
"Reproducibility" then can be reserved for whether re-doing the experiment or analysis as
described yields results that are comparable to the original results--within a
level of variability appropriate to the nature of the undertaking
(computational results are expected to be essentially identical;
physical and biological experiments are expected to have greater
variability from measurement error and intrinsic heterogeneity).
Some failures to reproduce are failures of preproducibility
(the description of the undertaking was inadequate or incorrect);
some are methodological failures (e.g., statistical methods were misused or abused);
and some represent generalization failures (e.g., the system is intrinsically variable,
or the effect is not large or reliable).
I will sketch the connection between preproducibility, evidence, and trust.
I will discuss my attempts to be more preproducible in research, collaboration,
teaching, and publication, including software tools (e.g., IPython, git, issue trackers),
practices (e.g., scripting analyses, revision control, documenting code), and
publication choices (open access, publishing code, etc.).
Science could be improved by adopting tools and practices that are common in software engineering.

---
### .blue[Quantifauxcation]

.framed[
Assign a meaningless number, then pretend that since it's quantitative, it's meaningful.
]

Claim: _most_ "probabilities" in policy and most cost-benefit analyses are quantifauxcation.

Usually involves some combination of
data, pure invention, ad hoc models, inappropriate statistics, and logical lacunae.

---
### .blue[Cost-Benefit analyses]

Widely touted as the only rational basis for decisions: must quantify costs/risks/benefits.

But if there's no rational basis for quantitative inputs, can it be rational
to insist on the analysis?

Not all "costs" can be put on a common scale. Some are incommensurable.
Multidimensional scales cannot always be well ordered.

The cost of most policy cost-benefit analyses is high: lost rationality

---
### .blue[Risk = probability × consequences?]

Interesting slogan, but:

* What if "probability" doesn't apply to the phenomenon? (more below)

* What if consequences cannot be quantified on a one-dimensional scale?

Insisting on quantifying risk and on quantitative cost-benefit analyses requires
putting a price on human life, on biodiversity, on relics, …

How do you incorporate uncertainty in probability (if it applies at all)
and uncertainty in consequences?

---
### .blue[What is Probability?]

#### Axiomatic aspect and philosophical aspect.

* Kolmogorov's axioms:
    - "just math"
    - triple `$ (S, \Omega, P)$`
        + `$S$` a set
        + `$\Omega$` a sigma-algebra on `$S$`
        + `$P$` a non-negative countably additive measure with total mass 1
--

* Philosophical theory that ties the math to the world
    - What does probability _mean_?
    - Standard theories
        + Equally likely outcomes
        + Frequency theory
        + Subjective theory
    - Probability models as empirical commitments
    - Probability as metaphor

---
### .blue[How does probability enter a scientific problem?]

* underlying phenomenon is random (radioactive decay?)

* deliberate randomization (randomized experiments, random sampling)

* subjective probability
    - Constraints versus priors
    - No posterior distributions without prior distributions
    - Prior generally matters
    - elicitation issues
    - arguments from consistency, "Dutch book," ...
    - why should I care about your subjective probability

* invented model that's supposed to describe the phenomenon
   - in what sense?
   - to what level of accuracy?
   - description v. prediction v. predicting effect of intervention
   - testable to desired level of accuracy?

* metaphor: phenomenon behaves "as if random"

---
### .blue[Two very different situations:]

.framed[
+ .green[Scientist creates randomness by taking a random sample,
assigning subjects at random to treatment or control, etc.]

+ .red[Scientist invents (assumes) a probability model for data the world gives.]
]

(1) allows sound inferences.

Inferences drawn in (2) are only as good as the assumptions.

.blue[Gotta check the assumptions against the world:]
Empirical support? Plausible? Iffy? Absurd?

---
### .blue[Making sense of probabilities in applied problems is hard]

* Probability often applied without thinking

* Reflexive way to try to represent uncertainty

* Not all uncertainty can be represented by a probability

* "Aleatory" versus "Epistemic"

---

- Aleatory
    + Canonical examples: coin toss, die roll, lotto, roulette
    + under some circumstances, behave "as if" random (but not perfectly)
- Epistemic: stuff we don't know

- Standard way to combining aleatory variability epistemic uncertainty puts
beliefs on a par with an unbiased physical measurement w/ known uncertainty.
    + Claims by introspection, can estimate without bias, with known accuracy,
just as if one's brain were unbiased instrument with known accuracy
    + Bacon's triumph over Aristotle should put this to rest, but empirically:
        - people are bad at making even rough quantitative estimates
        - quantitative estimates are usually biased
        - bias can be manipulated by anchoring, priming, etc.
        - people are bad at judging weights _in their hands_:  biased
by shape & density
        - people are bad at judging when something is random
        - people are overconfident in their estimates and predictions
        - confidence  unconnected to actual accuracy.
        - anchoring affects entire disciplines (e.g., Millikan, c, Fe in spinach)

- what if I don't trust your internal scale, or your assessment of its accuracy?
- same observations that are factored in as "data" are also used to form
beliefs: the "measurements" made by introspection are not independent of the data

---
### .blue[Rates versus probabilities]

* In a series of trials, if each trial has the same probability p of success,
and if the trials are independent, then the rate of successes converges (in probability)
to p.  Law of Large Numbers

* If a finite series of trials has an empirical rate p of success, that says
nothing about whether the trials are random.

* If the trials are random _and_ have the same chance of success, the empirical rate
is an estimate of the chance of success.

* If the trials are random _and_ have the same chance of success _and_ the dependence
of the trials is known (e.g., the trials are independent), can quantify the uncertainty
of the estimate.

---
### .blue[Thought experiments]

.framed[You are one of a group of 100 people.
You learn that one will die in the next year.

What's the chance it is you?
]

.framed[You are one of a group of 100 people.
You learn that one is named "Philip."

What's the chance it is you?
]

Why does the first invite an answer, and the second not?

Ignorance ≠ Randomness

---
### .blue[Cargo Cult Confidence Intervals]
.framed[
+ Have a collection of numbers, e.g., MME climate model predictions of
warming

+ Take mean and standard deviation.

+ Report mean as the estimate; construct a confidence interval or "probability" statement
from the results, generally using Gaussian critical values

+ IPCC does this; "7.9%" extinction paper Jeroen cited yesterday does this
]

#### .red[What's wrong with it?]

+ No random sample; no stochastic errors.

+ Even if there were a random sample, what justifies using normal theory?

+ Even if random and normal, misinterprets confidence as probability. Garbled;
something like Fisher's fiducial inference

+ Ultimately, quantifauxcation.

---
### .blue[Random versus haphazard/unpredictable]

* Consider taking a sample of soup to tell whether it is too salty.
    - Stirring the soup, then taking a tablespoon, gives a random sample
    - Sticking in a tablespoon without looking gives a haphazard sample

* Tendency to treat haphazard as random
    - random requires deliberate, precise action
    - haphazard is sloppy

* Notions like probability, p-value, confidence intervals, etc.,
_apply only if the sample is random_ (or for some kinds of measurement errors)

- Do not apply to samples of convenience, haphazard samples, etc.

- Do not apply to populations.

---

### .blue[Some brief examples]

* Avian / wind-turbine interactions

* Earthquake probabilities

* Climate models and climate change probabilities

---
### .blue[Wind power: "avian / wind-turbine interactions"]

Wind turbines kill birds, notably raptors.

+ how many, and of what species?

+ how concerned should we be?

+ what design and siting features matter?

+ how do you build/site less lethal turbines?

---
### .blue[Measurements]

Periodic on-the-ground surveys, subject to:

+ censoring

+ shrinkage/scavenging

+ background mortality

+ is this pieces of two birds, or two pieces of one bird?

+ how far from the point of injury does a bird land? attribution...

Is it possible to ...

+ make an unbiased estimate of mortality?

+ reliably relate the mortality to individual turbines in wind farms?

---
### .blue[Stochastic model]

Common: Mixture of a point mass at zero and some distribution on the positive axis.
E.g., "Zero-inflated Poisson"

Countless alternatives, e.g.:

+ observe `$\max\{0, \mbox{Poisson}(\lambda_j)-b_j\}$`, `$b_j > 0$`

+ observe `$b_j\times \mbox{Poisson}(\lambda_j)$`, `$b_j \in (0, 1)$`.

+ observe true count in area `$j$` with error `$\epsilon_j$`,
where `$\{\epsilon_j\}$` are dependent, not identically distributed,
nonzero mean

---
### .blue[Consultant]

* bird collisions random, Poisson distributed
* same for all birds
* independent across birds
* rates follow hierarchical Bayesian model that depends on
covariates: properties of site and turbine design

#### What does this mean?

* when a bird approaches a turbine, it tosses a coin to decide
whether to throw itself on the blades
* chance coin lands heads depends on site and turbine design
* all birds use the same coin for each site/design
* birds toss their coins independently

---
### .blue[Where do the models come from?]

+ Why random?

+ Why Poisson?

+ Why independent from site to site? From period to period? From bird to bird? From encounter to encounter?

+ Why doesn't chance of detection depend on size, coloration, groundcover, …?

+ Why do different observers miss carcasses at the same rate?

+ What about background mortality?

---
### .blue[Complications at Altamont]

+ Why is randomness a good model?  Random is not the same as haphazard or unpredictable.
+ Why is Poisson in particular reasonable?
Do birds in effect toss coins, independently, with
same chance of heads, every encounter with a turbine?
Is `#encounters $\times P(\mbox{heads})$` constant?
+ Why estimate the parameter of a contrived model rather than actual mortality?
+ Do we want to know how many birds die, or the value of `$\lambda$` in an implausible stochastic model?
+ Background mortality—varies by time, species, etc.
+ Are all birds equally likely to be missed?  Smaller more likely than larger? Does coloration matter?
+ Nonstationarity (seasonal effects—migration, nesting, etc.; weather; variations in
bird populations)
+ Spatial and seasonal variation in shrinkage due to groundcover, coloration, illumination, etc.
+ Interactions and dependence.
+ Variations in scavenging. (Dependence on kill rates? Satiation? Food preferences? Groundcover?)
+ Birds killed earlier in the monitoring interval have longer time on trial for scavengers.
+ Differences or absolute numbers? (Often easier to estimate differences accurately.)
+ Same-site comparisons across time, or comparisons across sites?

---
.framed[
#### .blue[The Rabbit Axioms]

1. For the number of rabbits in a closed system to increase,
the system must contain at least two rabbits.

2. No negative rabbits.
]

.framed[
#### .blue[Freedman's Rabbit-Hat Theorem]
You cannot pull a rabbit from a hat unless at least one
rabbit has previously been placed in the hat.
]

.framed[
#### .blue[Corollary]

You cannot "borrow" a rabbit from an empty
hat, even with a binding promise to return the rabbit later.
]

---
### .blue[Applications of the Rabbit-Hat Theorem]

* Can't turn a rate into a probability without assuming the phenomenon is
random in the first place.

* Can't conclude that a process is random without making assumptions
that amount to assuming that the process is random.
(Something has to put the randomness rabbit into the hat.)

* Testing whether the process appears to be random using the
_assumption_ that it is random cannot prove that it is
random. (You can't borrow a rabbit from an empty hat.)

---
### .blue[Earthquake probabilities]

* Probabilistic seismic hazard analysis (PSHA) is the basis for seismic
building codes in many countries; basis for siting nuclear power plants

* Models earthquakes as random in space, time, magnitude; independent magnitudes

* Models ground motion as random, given the occurrence of an event. Distribution
in a particular place depends on the location and magnitude of the event.

* Claim to estimate "exceedance probabilities": chance acceleration exceeds some
threshold in some number of years

* In U.S.A., codes generally require design to withstand accelerations w probability ≥2% in 50y.

* PSHA arose from probabilistic risk assessment (PRA) in aerospace and nuclear power.
 Those are engineered systems whose inner workings are known but for some system parameters and inputs.

* Inner workings of earthquakes are almost entirely unknown: PSHA is based on metaphors and heuristics, not physics.

* Some assumptions are at best weakly supported by evidence; some are contradicted.

---
###  .blue[The PSHA equation]

Model earthquake occurrence as a marked stochastic process with known parameters.

Model ground motion in a given place as a stochastic process, given the quake location and magnitude.

Then,

> probability of a given level of ground movement in a given place is the integral (over space and magnitude)
of the conditional probability
of that level of movement given that there's an event of a particular magnitude in a particular place,
times the probability that there's an event of a particular magnitude in that place

* That earthquakes occur at random is an _assumption_ not based in theory or observation.

* involves taking rates as probabilities
    - Standard argument:
        - M = 8 events happen about once a century.
        - Therefore, the chance is about 1% per year.

---
### .blue[Earthquake casinos]

* Models amount to saying there's an "earthquake deck"

* Turn over one card per period. If the card has a number, that's the
size quake you get.

* Journals and journals full of arguments about how many "8"s in the deck,
whether the deck is fully shuffled, whether cards are replaced and re-shuffled
after dealing, etc.

* .red[but this is just a metaphor!]

---
### .blue[Earthquake terrorism]

* Why not say earthquakes are like terrorist bombings?
    - don't know where or when
    - know they will be large enough to kill
    - know some places are "likely targets"
    - but no probabilities

* What advantage is there to the casino metaphor?

---
### .blue[Rabbits and Earthquake Casinos]

#### What would make the casino metaphor apt?

1. the physics of earthquakes might be stochastic. But it isn't.

2. stochastic models might provide a compact, accurate description
of earthquake phenomenology. But it doesn't.

3. stochastic models might be useful for predicting future seismicity.
But it isn't (Poisson, Gamma renewal, ETAS)

3 of the most destructive recent earthquakes were in regions seismic hazard maps showed to be relatively safe
(2008 Wenchuan M7.9, 2010 Haiti M7.1, & 2011 Tohoku M9)
[Stein, Geller, & Liu, 2012](http://web.missouri.edu/~lium/pdfs/Papers/seth2012-tecto-hazardmap.pdf)

#### What good are the numbers?

---
## .blue[Climate models]

---
### .blue[IPCC Cross-Working Group Meeting on Consistent Treatment of Uncertainties, 2010]

https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note.pdf (at p.2)

> … quantified measures of uncertainty in a finding expressed probabilistically (based on statistical
analysis of observations or model results, or expert judgment).

> … Depending on the nature of the evidence evaluated, teams have the option to quantify
the uncertainty in the finding probabilistically. In most cases, author teams will present either a
quantified measure of uncertainty or an assigned level of confidence.

> … Because risk is a function of probability
and consequence, information on the tails of the distribution of outcomes can be especially
important. …  Author teams are
therefore encouraged to provide information on the tails of distributions of key variables
…

---
### .blue[Cargo-cult confidence common in IPCC work:]

As mentioned above

+ have a list of numbers, not a sample from anything and certainly not a random sample
+ take mean and SD
+ treat as if random sample from Normal distribution
+ confuse confidence with probability
+ garble interpretation, using something like Fisher's fiducial inference

Result is gibberish.

---
### .blue[Do Monte Carlo simulations estimate real-world probabilities?]

Monte Carlo is a way to substitute computing for calculation.

It does not reveal anything that was not already an assumption in the calculation.

The distribution of the output results from the assumptions in the input.

The randomness in the formulation is an assumption, not a conclusion; the distribution
of that randomness is an assumption, not a conclusion.

---
### .blue[Does "expert judgment" reveal probability?]

Recall from above:

+ Claims by introspection, can estimate without bias, with known accuracy,
just as if one's brain were unbiased instrument with known accuracy.
+ But empirically,
    - people are bad at making even rough quantitative estimates
    - quantitative estimates are usually biased
    - bias can be manipulated by anchoring, priming, etc.
    - people are bad at judging weights _in their hands_:  biased
by shape & density
    - people are bad at judging when something is random.
    - people are overconfident in their estimates and predictions
    - confidence  unconnected to actual accuracy.
    - anchoring effects entire disciplines (e.g., Millikan, c, Fe in spinach)

+ what if I don't trust your internal scale, or your assessment of its accuracy?
+ same observations that are factored in as "data" are also used to form
beliefs: the "measurements" made by introspection are not independent of the data.

---
### .blue[[Rhodium Group Climate Prospectus](http://rhg.com/reports/climate-prospectus)]

"Risky Business" project co-chaired by Michael R. Bloomberg, Henry Paulson, Tom Steyer.

Funded by  Bloomberg Philanthropies, the Paulson Institute, and TomKat Charitable Trust.
Also Skoll Global Threats Fund, Rockefeller Family Fund,
McKnight Foundation, Joyce Foundation.

"While our understanding of climate change has improved dramatically in recent years,
predicting the severity & timing of future impacts remains a challenge.
Uncertainty surrounding the level of GHG emissions going forward & the sensitivity
of the climate system to those emissions makes it difficult to know exactly how much
warming will occur, & when.
Tipping points, beyond which abrupt & irreversible changes to the climate occur, could exist.
Due to the complexity of the Earth’s climate system, we do not know exactly how changes in
global average temperatures will manifest at a regional level.
There is considerable uncertainty about how a given change in temperature, precipitation,
or sea level will impact different sectors of the economy, & how these impacts will interact."

---
### & yet, …

"In this American Climate Prospectus, we aim to provide decision-makers in business & in
government with the facts about the economic risks & opportunities climate change poses in the United States."

They estimate the effect of changes in temperature and humidity on mortality, a variety of crops, energy use, labor force, and crime,
**at the county level** through 2099.

"evidence-based approach."

---
<img alt="Rhodium Scope" src="./ClimPics/AmericanClimateProspectusScope.jpg" width="80%" />

---
<img alt="Rhodium Scope" src="./ClimPics/AmericanClimateMortality.jpg" width="80%" />

---
<img alt="Rhodium crop" src="./ClimPics/AmericanClimateCrops.jpg" width="80%" />

---
<img alt="Rhodium labor" src="./ClimPics/AmericanClimateLabor.jpg" width="80%" />

---
<img alt="Rhodium crime" src="./ClimPics/AmericanClimateCrime.jpg" width="80%" />

---
### Sanity check

Even the notion that if you _knew exactly_ what the hourly temperature and humidity will be
in every square meter of the globe for the next hundred years, you could
accurately predict the effect on violent crime—on any timescale or any spatial
scale—is patently absurd.

Adding the uncertainty in temperature and humidity obviously doesn't make the problem easier.

Ditto for labor, crop yields, mortality, etc.

And that's even if _ceteris_ were _paribus_—which they will not be.

And that's even for next year, much less for 80 years from now.

And that's even for predicting a global average, not county-level predictions.

This is a hallucination, not Science.

---
## .blue[Talk 2: Reproducibility and "Preproducibility"]

---
.left-column[
### Why do we need a new term?
]

---
.left-column[
### Why do we need a new term?
### Many concepts, many labels, used inconsistently
]

.right-column[
+ replicable

+ reproducible

+ repeatable

+ confirmable

+ stable

+ generalizable

+ reviewable

+ auditable

+ verifiable

+ validatable
]

---
.left-column[
### Always some _ceteris_  assumed  _paribus_ … approximately. (c.f., Andrea's famous line)
]

.right-column[
+ Similar result if experiment is repeated in same lab?

+ Similar result if procedure repeated elsewhere, by others?

+ Similar result under similar circumstances?

+ Same numbers/graphs if data analysis is repeated by others?
]

--
.full-width[
### With respect to what changes is the result stable?
### Changes of what size?
### How stable?
]

---
### What _ceteris_ need not be _paribus_?

### .blue[Science may be described as the art of systematic over-simplification—the art of discerning what we may with advantage omit.]    —Karl Popper

---
### The desired level of abstraction/generalization _defines_ the scientific discipline**
--

+ If you want to generalize to all time and all universes, you are doing math

+ If you want to generalize to our universe, you are doing physics

+ If you want to generalize to all life on Earth, you are doing molecular and cell biology

+ If you want to generalize to all mice, you are doing murine biology

+ If you want to generalize to C57BL/6 mice, I'm not sure what kind of science you are doing

+ If you only care about one mouse in one lab in one experiment on one day, I'm not sure you're doing science

The tolerable variation in experimental conditions depends on the desired inference.

.blue[If variations in conditions that are irrelevant to the discipline cause the results to vary,
there's a replicability problem: the _outcome_ doesn't have the right level of abstraction.]

** Cf., "All science is either physics or stamp collecting." —Lord Rutherford

---
#### JBS Haldane, 1926. "[On Being the Right Size](http://irl.cs.ucla.edu/papers/right-size.pdf)," *Harper's Magazine*

> You can drop a mouse down a thousand-yard mine shaft; and, on arriving at the bottom, it gets a slight shock and walks away, provided that the ground is fairly soft. A rat is killed, a man is broken, a horse splashes. For the resistance presented to movement by the air is proportional to the surface of the moving object. …

<!---
Divide an animal’s length, breadth, and height each by ten; its weight
is reduced to a thousandth, but its surface only to a hundredth. So the resistance to falling
in the case of the small animal is relatively ten times greater than the driving force.

> An insect, therefore, is not afraid of gravity; it can fall without danger, and can cling
to the ceiling with remarkably little trouble. It can go in for elegant and fantastic forms of
support like that of the daddy-longlegs. But there is a force which is as formidable to an
insect as gravitation to a mammal. This is surface tension. A man coming out of a bath
carries with him a film of water of about one-fiftieth of an inch in thickness. This weighs
roughly a pound. A wet mouse has to carry about its own weight of water. A wet fly has
to lift many times its own weight and, as everyone knows, a fly once wetted by water or any
other liquid is in a very serious position indeed. An insect going for a drink is in as great
danger as a man leaning out over a precipice in search of food. If it once falls into the grip
of the surface tension of the water—that is to say, gets wet—it is likely to remain so until it
drowns. A few insects, such as water-beetles, contrive to be
--->

Is this physics, biology, or what?

---
### Abstraction and Replicability
+ If something only happens under *exactly* the same circumstances, unlikely to be useful.

+ What factors may we omit from consideration?

+ If an attempt to replicate/reproduce fails, _why_ did it fail?
    + The effect is intrinsically variable or intermittent
    + The result is a statistical fluke or "false discovery"
    + Something that mattered was different: need to qualify the claim
--

.red[If the necessary qualification is too restrictive, the result might change disciplines.]

---
### How can you tell whether you are performing substantially the same experiment?
--

### .blue[Need _preproducibility_: a description that includes those things that we may *not* with advantage omit.]

--
### Why care?

--
### **Without preproducibility, there's just a story, not scientific evidence.** Can't verify claims.

--
### .red[Science should not require trusting authority.]

--
### .blue[Science is "show me," not "trust me."]

---
.left-column[
### Questions
]

.right-column[
+ materials (organisms), instruments, procedures, & conditions specified adequately to allow repeating data collection?

+ data analysis described adequately to check/repeat?

+ code & data available to re-generate figures and tables?

+ code readable and checkable?

+ software build environment specified adequately?

+ what is the evidence that the result is correct?

+ how generally do the results hold? how stable are the results to perturbations of the experiment?
]

---
.left-column[
### Questions, questions
]
.right-column[
+ What's the underlying experiment?

+ What are the raw data? How were they collected/selected?

+ How were raw data processed to get "data"?

+ How were processed data analyzed?

+ Was that the right analysis?

+ Was it done correctly?

+ Were the results reported correctly?

+ Were there ad hoc aspects? What if different choices had been made?

+ What other analyses were tried? How was multiplicity treated?

+ Can someone else use the procedures and tools?
]

---
.left-column[
### Variation: wanted and unwanted
]
.right-column[
+ Focus in this conference on variation with genotype, biology, lab, procedures, handlers, …

+ Desirable that results are stable wrt *some* kinds of variability

+ OTOH, variability itself can be scientifically interesting

+ .blue[As a statistician, I worry also about variation with analysis/methodology & *implementation* of tools]

+ Undesirable for the analysis to be unstable, but algorithms matter, numerics matter, PRNGs matter, …

+ Relying on packaged tools can be a problem. Much commercial software has no warranty of fitness for use. Many bugs in Excel's Statistical routines; some in MATLAB too.
]

---
## Computational preproducibility

### "Rampant software errors undermine scientific results"
David A.W. Soergel, 2014
http://f1000research.com/articles/3-303/v1

Abstract: Errors in scientific results due to software bugs are not limited to a few high-profile
cases that lead to retractions and are widely reported.
Here I estimate that in fact most scientific results are probably
wrong if data have passed through a computer, and that these errors may remain largely undetected.
The opportunities for both subtle and profound errors
in software and data management are boundless, yet they remain surprisingly underappreciated.

---
## How can we do better?
Adopt tools from software development world:
+ Revision control systems (Dropbox and Google Drive are not revision control)

+ Documentation, documentation, documentation

+ Coding standards/conventions

+ Pair programming

+ Issue trackers

+ Code reviews (and in teaching, grade students' *code*, not just their *output*)

+ Unit testing

+ Code coverage testing

+ Regression testing

+ Scripted analyses: no point-and-click tools, _especially_ spreadsheet calculations

---
### Spreadsheets are OK for data entry. But not for calculations.
+ Conflates input, code, output, presentation
+ UI invites errors, then obscures them
+ Debugging extremely hard
+ Unit testing hard/impossible
+ Replication hard/impossible
+ Code review hard
+ [European Spreadsheet Risk Interest Group](www.eusprig.org) horror stories:
 + Reinhart & Rogoff: justification for S. European austerity measures
 + JP Morgan Basel II VAR: risk understated
 + IOC: 10,000 tickets oversold
 + Knox County, TN; W. Baraboo Village, WI; … : errors costing $millions
+ According to KPMG and PWC, [over 90% of corporate spreadsheets have errors]([http://www.theregister.co.uk/2005/04/22/managing_spreadsheet_fraud/)

--
#### Bug in the PRNG for many generations of Excel, allegedly fixed in Excel 2010.
#### Other long-standing bugs in Excel; PRNG still won't accept a seed; etc.
--

### .red["Stress tests" of international banking system use Excel simulations: Be Afraid. Be Very Afraid.]

---

### .blue[Relying on spreadsheets for important calculations is like driving drunk:]
--

### .red[No matter how carefully you do it, a wreck is likely.]

---
### .blue[Openness (data, code, open-access publication) & preproducibility: enable evidence of correctness]

.left-column[
### .red[Obstacles & Excuses]
]

.right-column[
+ time & effort

+ no direct academic credit

+ importance not appreciated

+ requires changing habits, tools, etc.

+ fear of scoops

+ fear of exposure of flaws

+ IP/privacy issues, data moratoria, etc.

+ lack of tools, training, infrastructure

+ lack of support from journals, length limits, etc.

+ lack of standards? lack of shared lexicon?
]

---
### Stodden (2010) Survey of NIPS:
<hr />
.left[
**Code**

77%

52%

44%

40%

34%

N/A

30%

20%
]

.middle[
.center[
**Complaint/Excuse**

Time to document and clean up

Dealing with questions from users

Not receiving attribution

Possibility of patents

Legal Barriers (ie. copyright)

Time to verify release with admin

Potential loss of future publications

Competitors may get an advantage

Web/disk space limitations
]
]

.right[
**Data**

54%

34%

42%

N/A

41%

38%

35%

33%

29%
]

--
.full-width[
### .red[Fear, greed, ignorance, & sloth.]
]

---
.left-column[
### Benefits of Openness and Computational Preproducibility
]

.right-column[
+ encourages careful documentation & coding

+ enables checking, evidence of correctness

+ share/collaborate w/ others & your future self

+ efficiency from tool re-use and extension

+ greater impact

+ greater scientific throughput overall (Claerbout model)
]
--

.full-width[
#### .blue[If I say "just trust me" and I'm wrong, I'm untrustworthy.]

#### .blue[If I say "here's my work" and it's wrong, I'm honest and human.]
]

---

## Using revision-control systems for teaching, research, collaboration
+ Dropbox / Box / Google Drive are _not_ revision-control systems
+ Teaching use cases:
    + submit homework by pull request (can see commits)
    + collaborate on term projects using Git (_blame_ and _diffs_ help) Teaching students how to collaborate effectively is extremely valuable
    + create project wikis
    + use for timed exams: push at a coordinated time, pull requests
    + supports automated testing of code
+ Research use cases
    + 1st step of new project: create a repo
    + commits leave breadcrumbs
    + notes, code, manuscripts, etc. (not ideal for large datasets)
    + know last version that worked
    + no "file.1," "file.1~," "file.2-final," "file.2-really-really-final," etc.
+ Collaboration use cases
    + parallel development through branches
    + feature implementation
    + "what if?" branches
    + can find last working version of code

---
## Script analyses and use "lab notebook"-style tools

+ IPython/Jupyter notebook (Sweave and knitR are great for papers; less good for workflow), ...

+ ≠ literate programming

+ leave breadcrumbs

+ readable

+ easy to re-run analysis

+ easy to substitute alternatives

+ easy to build on previous analyses

---
## Preproducibility is collaboration w/ people you don't know,
--
 including yourself next week.

.left-column[
### Preproducibility & collaboration
]

.right-column[
+ same habits, attitudes, principles, and tools facilitate both

+ develop better work habits, *computational hygiene*

+ analogue of good lab technique in wet labs
]

.full-width[.blue[Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do. —Knuth, 1984]]

---
.left-column[
### Towards preproducibility
]

.right-column[
#### Carrot & stick, social engineering
+ eliminate the moral hazard

+ criterion for promotions: show your work!
 (not just the ad for your work)

+ altmetrics

+ (enforce) funding agency requirements

+ (enforce) journal requirements

+ [Thermo ML](http://trc.nist.gov/ThermoML.html):
      ≈ 20% of papers that otherwise would
      have been accepted had serious errors.
]

---
layout: true

# Personal failure stories

+ Multitaper spectrum estimation for time series with gaps: lost $C$ source for MEX files;
old MEX files not compatible with some systems.

> Unfortunately I was not able to find my code for multitapering. I am
> pretty sure I saved them after I finished my thesis, along with all
> the documentation, but it seems like I lost them through one of the
> many computer moves and backups since.  I located my floppy (!) disks
> with my thesis text and figures but not the actual code.

+ Poisson tests of declustered catalogs: current version of code does not run.

---
layout: true

# Why work reproducibly?
> There is only one argument for doing something; the rest are arguments for doing nothing.

> The argument for doing something is that it is the right thing to do.

> .right[--Cornford, 1908. *Microcosmographia Academica*]

---
layout: false
# When and how?

+ Built-in or bolt-on?
+ Tools
+ Training
+ Developing good habits
+ Changing academic criteria for promotions:

How nice that you advertised your work in *Science*, *Nature*, *NEJM*, etc.!

Where's the actual work?

Where's the evidence that it's right? That it's useful to others?

---
layout: false

.left-column[
### Tools and Practices
]

.right-column[
+ **use revision control**

+ avoid point-and-click tools, especially spreadsheets:
 
 error prone, not debuggable, not reproducible, hard to do unit testing

+ script your analyses

+ use standard, trackable build environment

+ use issue trackers

+ use open source tools when possible

+ document your work

+ use "lab notebook" style tools, e.g., [Jupyter/IPython](http://jupyter.org/) (supports Python, R, Julia)

]

---
.left-column[
### [Berkeley Common Environment (BCE)](http://collaboratool.berkeley.edu)

easy-install reproducible recipe for software environment to work reproducibly

]

.right-column[
+ OS matters, versions matter, build environments matter, …

+ in computational courses, can take two weeks to get everyone
  "on the same page" w/ software, VMs, etc.

+ work done by one PhD student is rarely usable by the advisor or the next PhD
  student—much less by anyone else

+ BCE use cases: teaching (and exams), research labs, multi-PI & multi-institute collaborations

]

---
.left-column[
### [Berkeley Common Environment (BCE)](http://collaboratool.berkeley.edu)
]

.right-column[
Ingredients
+ ubuntu
+ ansible
+ docker
+ vagrant
+ lxc
+ git, a git gui, gitlabhq, gitannex assistant
+ R with various libraries for statistics and machine learning
+ Python + IPython + Numpy + Scipy + Matplotlib + Pandas + Cython + other libraries
+ mySQL, SQLite
+ LaTeX, BibTeX + AMS, Beamer, & other styles
+ test suites for all the software
]

.full-width[
+ Easy to install on any platform: ≈2 clicks.

+ Easy to "spin up" as many instances as needed in the cloud.
]
---
### It's hard to teach an old dog new tricks.

### .blue[Solution: Work with puppies.]

--
### Statistics 159/259: Reproducible & Collaborative Statistical Data Science
+ Project (2013): improving earthquake forecasts for Southern CA

+ [2013 Syllabus](http://www.stat.berkeley.edu/~stark/Teach/ReproData/syllabus.pdf)
 includes intro to virtual machines, git, issue trackers, GitHub, IPython,
 & the scientific problem (tried to) reproduce previous work focus on doing science well, collaboratively, using good process—not on tools <http://youtu.be/Bq71Pqdukeo>

+ [2014 Syllabus](https://github.com/ucb-stat-157/fall-2014-public) includes intro to VMs, git, Python
and IPython, code review, unit testing, data visualization, code efficiency, unstructured
data, MapReduce and Hadoop, AWS, SQL, databases