class:textShadow ### Preproducibility, Reproducibility, Replicability:
First Things First #### Philip B. Stark #### Department of Statistics, University of California, Berkeley
http://www.stat.berkeley.edu/~stark | @philipbstark #### Geodynamics and Big Data
Conference in honor of Dave Yuen
9–11 June 2018 --- .center.vcenter.large[An experiment or analysis is _preproducible_
if it has been described in adequate detail for others to undertake it.] --- ### Preproducibility in a nutshell: -- .blue.center.large[Show your work.] -- Provide _evidence_ that you are right and a way to check, not just a claim. --- ### What is the purpose of scientific publishing? -- + Establish priority / get credit? -- + Communicate claims? -- + Provide evidence that claims are correct? -- + Provide enough information that others can re-undertake and verify? -- + Provide methods to others, to contribute to science as a societal undertaking? --- .center[
] --- .center[
] --- .center[
] --- .center[
] --- .center[
] --- .vcenter[An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.
—Buckheit and Donoho, 1995] -- By working preproducibly, you … -- + allow others "without mistake, and with as little trouble as possible, to be able to repeat such unusual experiments" -- + make "multiplication" and "virtual witnessing" possible -- + provide evidence that your claim is a fact --- If you do not work preproducibly, you are … -- + merely advertising the result -- + asking others to take the result on faith -- + withholding crucial evidence needed to check or repeat your work -- + making actual replication/reproduction even less likely --- .center.vcenter.large.blue[Science should be _show me_,
not _trust me_.] -- .center.vcenter.large.red[_Nullius in verba_] --- .left-column[ ### Many concepts, many labels, used inconsistently ] .right-column[ + replicable + reproducible + repeatable + confirmable + stable + generalizable + reviewable + auditable + verifiable + validatable ] -- .full-width[Generally about whether something happens again.
.blue[No term for "not enough information to try."]] --- ### Preproducibility versus Reproducibility and Replicability + A failure of _preproducibility_ is often a failure of scientific _communication_. -- + A failure of _reproducibility_ or _replicability_ could be a mark of a false discovery, a failure of practice, or a sign of something scientifically interesting --- .left-column[ ### Repeatability, replicability, reproducibility, ...,
some _ceteris_ assumed _paribus_ … approximately. ] .right-column[ + Similar result if experiment is repeated in same lab? + Similar result if procedure repeated elsewhere, by others? + Similar result under similar circumstances? + Same numbers/graphs if data analysis is repeated by others? ] -- .full-width[ + With respect to what changes is the result stable? + Changes of what size? + How stable? ] --- ### What _ceteris_ need not be _paribus_? --
### .blue.center[Science may be described as the art of systematic over-simplification—the art of discerning what we may with advantage omit. —Karl Popper] -- ### .center[_Preproducibility_ means identifying and specifying those things that we may *not* with advantage omit.] --- ### The desired level of abstraction/generalization _defines_ scientific disciplines
-- + If you want to generalize to all time and all universes: math -- + If you want to generalize to our universe: physics -- + If you want to generalize to all life on Earth: molecular and cell biology -- + If you want to generalize to all mice: murine biology -- + If you want to generalize to C57BL/6 mice: I don't know -- + This mouse in this lab in this experiment today: maybe not science? -- The tolerable variation in experimental conditions depends on the desired inference. -- .blue[If variations in conditions that are irrelevant to the discipline cause the results to vary, there's a replicability problem: the _outcome_ doesn't have the right level of abstraction.] -- ** Cf, "All science is either physics or stamp collecting." —Lord Rutherford --- ### Abstraction and Replicability + If something only happens under *exactly* the same circumstances, unlikely to be useful. -- + What factors may we omit from consideration? -- + If an attempt to replicate/reproduce fails, _why_ did it fail? (cf Newton) + The effect is intrinsically variable or intermittent + The result is a statistical fluke or "false discovery" + Something that mattered was different -- .red[If the necessary qualification is too restrictive, the result might change disciplines.] --- .left-column[ ### Questions ] .right-column[ + materials (organisms), instruments, procedures, & conditions specified adequately to allow repeating data collection? + data analysis described adequately to check/repeat? + code & data available to re-generate figures and tables? + code readable and checkable? + software versions and build environment specified adequately? + what is the evidence that the result is correct? + how generally do the results hold? how stable are the results to perturbations of the experiment? ] --- .left-column[ ### Questions, questions ] .right-column[ + What's the underlying experiment? + What are the raw data?
How were they collected/selected? + How were raw data processed to get "data"? + How were processed data analyzed? + Was that the right analysis? + Was it done correctly? + Were the results reported correctly? + Were there ad hoc aspects?
What if different choices had been made? + What other analyses were tried?
How was multiplicity treated? + Can someone else use the procedures and tools? ] --- .left-column[ ### Variation: wanted and unwanted ] .right-column[ + Variation with genotype, biology, lab, procedures, handlers, reagents, … + Desirable that results are stable wrt *some* kinds of variability + OTOH, variability itself can be scientifically interesting + .blue[I worry also about variation with analysis/methodology & *implementation* of tools] + Undesirable for the analysis to be unstable, but algorithms matter, numerics matter, PRNGs matter, … + Relying on packaged/commercial tools can be a problem. ] --- ### Spreadsheets might be OK for data entry.
But not for calculations. + Conflate input, code, output, presentation + UI invites errors, then obscures them + Debugging hard + Unit testing hard/impossible + Replication hard/impossible + Code review hard + According to KPMG and PWC, [over 90% of corporate spreadsheets have errors]([http://www.theregister.co.uk/2005/04/22/managing_spreadsheet_fraud/) -- + Bug in the PRNG for many generations of Excel, allegedly fixed in Excel 2010. -- + Other bugs in Excel +, *, statistical routines; PRNG still won't accept a seed; etc. ---
--- ### .blue[Relying on spreadsheets for important calculations is like driving drunk:] -- ### .red[No matter how carefully you do it, a wreck is likely.] --- ### [2014 Coverity study](http://go.coverity.com/rs/157-LQW-289/images/2014-Coverity-Scan-Report.pdf) + 0.61 errors per 1,000 lines of source code in open-source projects + 0.76 errors per 1,000 lines of source code in commercial software --
Scientists generally don't use good software engineering practices, so expect worse in practice. ---
--- ### [Thermo ML](http://trc.nist.gov/ThermoML.html): ≈ 20% of papers that otherwise would have been accepted had serious errors. --- ### Stodden (2010) Survey of NIPS re code & data:
.left[ **Code** 77% 52% 44% 40% 34% N/A 30% 30% 20% ] .middle[ .center[ **Complaint/Excuse** Time to document and clean up Dealing with questions from users Not receiving attribution Possibility of patents Legal Barriers (ie. copyright) Time to verify release with admin Potential loss of future publications Competitors may get an advantage Web/disk space limitations ] ] .right[ **Data** 54% 34% 42% N/A 41% 38% 35% 33% 29% ] -- .full-width[ ### .red[Fear, greed, ignorance, & sloth.] ] --- ### Hacking the limbic system .red[If I say _just trust me_ and I'm wrong, I'm untrustworthy.] -- .blue[If I say _here's my work_ and it's wrong, I'm honest, human, and serving scientific progress.] -- ### .blue.center.vcenter[Science should be "help me if you can,"
not "catch me if you can."] --- ### Revision-control systems for teaching, research, collaboration + Teaching use cases: + submit homework by pull request (can see commits) + collaborate on term projects + create project wikis + use for timed exams: push at a coordinated time, pull requests + supports automated testing of code + Research use cases + 1st step of new project: create a repo + commits leave breadcrumbs + notes, code, manuscripts, etc. (not ideal for large datasets) + know last version that worked + Collaboration use cases + parallel development & feature implementation through branches + "what if?" branches + can find last working version of code + _blame_ --- ### Scripts & notebook-style tools + IPython/Jupyter notebook (Sweave and knitR are great for papers; less good for workflow), ... + leave breadcrumbs + readable + easy to re-run and modify analysis + easy to build on previous analyses --- ### Preproducibility is collaboration w/ people you don't know, -- including yourself next week. -- .left-column[ ### Preproducibility & collaboration ] .right-column[ + same habits, attitudes, principles, and tools facilitate both + develop better work habits, *computational hygiene* + analogue of good lab technique in wet labs ] --- ### Why work p/reproducibly? .vcenter.blue[There is only one argument for doing something; the rest are arguments for doing nothing.
The argument for doing something is that it is the right thing to do.
—Cornford, 1908. *Microcosmographia Academica*] --
My top reasons: 1. I feel good about it 1. Others can check my work and correct it if it's wrong. 1. Others can re-use and extend my work more easily. -- .blue[_Others_ includes me, next week.] --- ### How can we do better? + Scripted analyses: no point-and-click tools, _especially_ spreadsheets + Revision control systems + Documentation, documentation, documentation + Coding standards/conventions + Pair programming + Issue trackers + Code reviews (and in teaching, grade students' *code*, not just their *output*) + Code tests: unit, integration, coverage, regression --- ### Checklist 1. Don't use spreadsheets for calculations. 1. Script your analyses, including data cleaning and munging. 1. Document your code. 1. Record and report software versions, including library dependencies. 1. Use unit tests, integration tests, coverage tests, regression tests. 1. Avoid proprietary software that doen't have an open-source equivalent. 1. Report all analyses tried (transformations, tests, selections of variables, models, etc.) before arriving at the one emphasized. 1. Make code and code tests available. 1. Make data available in an open format; provide data dictionary. 1. Publish in open journals. --- ### Why open publication? + Research funded by agencies + Conducted at universities by faculty et al. + Refereed/edited for journal by faculty at no cost to journal + Pages charges paid by agencies + Exclusionary & morally questionable for readers have to pay to view --- #### [ARL 1986-2016](http://arl.nonprofitsoapbox.com/storage/documents/expenditure-trends.pdf) Also [CFUCBL rept](http://evcp.berkeley.edu/sites/default/files/FINAL_CFUCBL_report_10.16.13.pdf)
] --- ### It's hard to teach an old dog new tricks. -- ### .blue[So work with puppies!] --- ### Teach tools by doing science preproducibly, not by focusing on tools -- ### .blue.center[Eyes on the code, not just the output!] ---
--- ### Pledge > A. _I will not referee any article that does not contain enough information to tell whether it is correct._ > B. _Nor will I submit any such article for publication._ > C. _Nor will I cite any such article published after [date]._ See also _Open science peer review oath_ http://f1000research.com/articles/3-271/v2 --- #### Recap + What does it mean to work p/reproducibly? + Show your work! --- #### Some resources + [Data Carpentry](http://www.datacarpentry.org/), [Software Carpentry](http://software-carpentry.org/) + [RunMyCode](http://www.runmycode.org/), [Research Compendia](http://researchcompendia.org/), [FigShare](https://figshare.com/) + [Jupyter](http://jupyter.org/) (>40 languages!), [Sweave](https://www.statistik.lmu.de/~leisch/Sweave/), [RStudio](https://www.rstudio.com/), [knitr](http://yihui.name/knitr/) + [BCE](http://bce.berkeley.edu/), [Docker](https://www.docker.com/), [Virtual Box](https://www.virtualbox.org/), [AWS](https://aws.amazon.com/), ..., [GitHub](https://github.com/), [TravisCI](https://travis-ci.org/), [noWorkFlow](https://pypi.python.org/pypi/noworkflow), ... + Reproducibility initiative http://validation.scienceexchange.com/#/reproducibility-initiative + Best practices for scientific software dev http://arxiv.org/pdf/1210.0530v4.pdf + [Federation of American Societies for Experimental Biology](https://www.faseb.org/Portals/2/PDFs/opa/2016/FASEB_Enhancing%20Research%20Reproducibility.pdf) + Was ist open science? http://openscienceasap.org/open-science/