class: center, middle, inverse, title-slide # Building R Packages ## ‘Good Enough’ Practices for Applied Statistics ###
Nima Hejazi
### UC Berkeley ### 2020-02-12 (updated: 2020-02-12) --- class: inverse, middle, center # Science x Software x Statistics --- # Robust Statistical Software for Science "The origins of the scientific method...amount to insistence on direct evidence. This is reflected in the motto of The Royal Society, founded in 1660: _Nullius in verba_, which roughly means 'take nobody's word for it'." -Philip B. Stark, in _The Practice of Reproducible Research_ - The scientific paradigm is _show me, not trust me_: we need robust, well-documented, rigorously tested code to ensure validity, reproducibility, reusability. - The software we distribute allows others to engage with our analysis of a data set or engage in testing a (newly) proposed methodology. - How are conclusions from data analysis to be believed if the code that underlies them cannot be closely scrutinized? - Methodological developments are slow in their uptake: easy-to-use software aids this process. --- # The code I wrote that one time... "An article...in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures." -David Donoho, _WaveLab_ ## We want software for reproducible applied statistics - Clear, easily accessible documentation, including examples. - Unit testing to rigorously assess functions, classes, and other constructs. - Continuous integration, ensuring accessibility across operating systems and constant-time monitoring of software quality. - Open source development, which embodies an ongoing, continuous, public peer review of the (academic) software product. ## Why write an R package? - To distribute R code and documentation - To keep track of the misc. R functions you write and reuse - To distribute data and software accompanying a paper --- # Documentation "_A Year is a Long Time in this Business._ Once, about a year after one of us had done some work and written an article (and basically forgot the details of the work he had done), he had the occasion to apply the methods of the article on a newly-arrived dataset. When he went back to the old software library to try and do it, he couldn't remember how the software worked -- invocation sequences, data structures, etc. In the end, he abandoned the project, saying he just didn't have time to get in to it anymore." -excerpted from _WaveLab_ - Old scripts (if they can even be found) can be difficult to adapt to new projects. - Most often, old code is hard to find or lost completely. - If found, time and effort is then wasted in adapting old scripts to a new purpose. - More commonly, one remembers the procedure but then has to write customized variants of the old code from scratch. - By abstracting expected inputs and procedures, packages allow for code to more easily be reused or adapted. - Of course, the package must be documented well in order to ensure that its use is straightforward. - ...and rigorously tested; otherwise, incorrect code might end up being used across many analyses. --- # "Good enough" practices for R packages 1. Version control (e.g., via `git`) - Use _formal version control_ to keep track of package development. - Version numbers of R packages alone are not enough to track changes. - Use git/GitHub to facilitate collaboration and engage in open source work. 2. Minimal package anatomy and documentation - R packages have have a particular directory structure and metadata format. - Documentation can be included in plain text files via [`Roxygen`](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html) 3. Unit testing (e.g., via the `testthat` suite) - To ensure that functions/classes work as expected, use static unit tests. - Tests should focus be _atomic_, focusing on a few lines of code. 4. Automated checking via continuous integration - Packages must be easily accessible across common operating systems. - Ensure that unit tests are re-run _every time_ that code is changed. --- class: inverse, middle, center # Git: the bare necessities --- # Why Git ## Why version control? - Keep a history of changes that can be easily parsed. - Be able to go back and forth in the project history. - No need to worry about breaking things that are working. - Collaboration: merge changes from multiple people. -- ## A brief history - Formal version control system pioneered by Linus Torvalds in 2005. - Designed to manage the source code for the Linux kernel. - Tracks any content, but best used with plain text files. -- ## Git - It's free, open source, and fast. - Distributed: You don't need access to a server. - Greatly simplifies the merging of simultaneous changes. --- # Basic usage ## Essential commands - Initialize a repository: `git init` - Create and/or change some files - See what you’ve changed: `git status`, `git diff`, `git log` - Indicate what changes to save: `git add` - Commit to those changes: `git commit` - Push the changes to GitHub: `git push` - Pull changes from collaborators: `git pull` - Note: `.gitignore` file allows extraneous files to be ignored. -- ## Learning more - [Version Control with Git (Software Carpentry)](http://swcarpentry.github.io/git-novice/) - [_Happy Git with R_ (Jenny Bryan)](https://happygitwithr.com/) - [Tools for Reproducible Research (Karl Broman)](https://kbroman.org/Tools4RR/) - [Introduction to Git (Berkeley's STAT 159/259)](http://www.jarrodmillman.com/rcsds/standard/git-intro.html) --- class: inverse, middle, center # Anatomy of an R package --- # Anatomy ```bash ph292rtools/ DESCRIPTION NAMESPACE .Rbuildignore R/trimmed_mean.R R/plots.R R/utils.R man/trimmed_mean.Rd man/plotting.Rd man/utils.Rd tests/testthat.R tests/testthat/test-trimmed_mean.R vignettes/intro_ph292rtools.Rmd ``` - Meta-data stored in the `DESCRIPTION` file. - Documentation files (`NAMESPACE`, `*.Rd`) auto-generated by `devtools::document()` - `.Rbuildignore` specifies files not relevant to the package (e.g., `.git`). --- # R files - Use the `R` directory to store `.R` source files. -- - Organize functions across multiple files. - Not a single file containing all functions. - But not necessary to have all functions in single files. - Give meaningful names to each file, e.g., `plotting.R` not `utils.R` -- - Follow code styling practices based on some style guide. - Google's: https://google.github.io/styleguide/Rguide.html - Tidyverse: https://style.tidyverse.org/ - Use the [`lintr` package](https://CRAN.R-project.org/package=lintr) to warn about style issues. - Use the [`styler`](https://CRAN.R-project.org/package=styler) package to enforce a particular style. -- - Use spaces, specify arguments by name, fully write out logicals: - `median(x = x, na.rm = FALSE)`, not `median(x)` or `median(x, na.rm=F)` - Respect the 80-column rule (80 characters per line). - Also, use spaces, not tabs...really, please don't. --- # Metadata - The `DESCRIPTION` file: ```bash Package: ph292rtools Title: The R Toolbox of Public Health 292 at UC Berkeley Version: 0.0.1 Authors@R: person("Nima", "Hejazi", email = "nh@nimahejazi.org", role = c("aut", "cre", "cph"), comment = c(ORCID = "0000-0002-7127-2789")) Description: Maintainer: Nima Hejazi <nh@nimahejazi.org> Depends: Imports: Suggests: License: MIT + file LICENSE URL: https://github.com/nhejazi/ph292rtools BugReports: https://github.com/nhejazi/ph292rtools/issues Encoding: UTF-8 LazyData: true RoxygenNote: 7.0.2 ``` -- - Auxiliary files: `.Rbuildignore`, `cran-comments.md`, `NEWS.md`, `LICENSE` - The `LICENSE` file can be included but must be `.Rbuildignore`'d. --- # Documentation with Roxygen - The [`roxygen2` package](https://CRAN.R-project.org/package=roxygen2) makes writing documentation much easier. ```r #' @title Trimmed mean #' @description Compute a truncated/trimmed mean that ignores some subset of #' observations based on where they fall in the empirical distribution. #' #' @param x A \code{numeric} vector of observations over which the trimmed mean #' is to be computed. #' @param prop A \code{numeric} scalar indicating proportion of observations #' to be ignored in the trimming process. #' @param direction A \code{character} indicating the direction in which the #' trimming operation is to be done. #' #' @export trimmed_mean <- function(x, prop, direction) { # we'll fill this in later } ``` -- - `@title` and `@description` give information about the function. - `@param` specifies descriptions of the arguments of the function. - `@export` says that the function is not meant to be internal (more later). --- # Testing and test files ## Types of tests - Checking of inputs - Stop if inputs aren't as expected; facilitates user-safety. - May use conditionals, `stopifnot`, or the `asserthat` package. - Unit tests - Does a small function give the right results in a specific case? - Written for each function, covering _atomic_ level functionality. - Integration tests - To check that larger multi-function tasks are working properly. - Regression tests - Compare output to saved results for long-term consistency. -- ## Organizing tests - The [`testthat` package](https://testthat.r-lib.org/) is a popular framework for unit testing. - In this case, a file `tests/testthat.R` runs all the unit tests. - Individual unit tests are kept in `tests/testthat/example_test.R` ??? - If you don't test your code, how do you know it works? - If you test your code, save and automate those tests. - Check the input to each function. - Write unit tests for each function. - Write some larger regression tests. - Turn bugs into tests. --- # Vignettes and static documentation - Vignettes are stored under the `vignettes` subdirectory. - These are most easily written in `.Rmd` files that combine code examples with explanations of package functionality. - These are a great resource for providing long-form documentation of the functionality of your package. -- - The `pkgdown` package provides utilities to generate reference websites. - This process directly uses the `roxygen2`-style documentation. - It's simple to do, requiring only a call to `pkgdown::build_site()`. -- - Check out: https://pkgdown.r-lib.org/ for its documetation. -- - The websites can easily be customized via `_pkgdown.yml`. --- # My friend Travis - Continuous integration services are designed to ensure that software can be built and tested with each incremental change. -- - Travis-CI is one of several popular continuous integration services. - Others include AppVeyor and CircleCI. - Travis allows free builds on Linux systems. - AppVeyor allows free builds on Windows systems. -- - For R packages, a standard practice is to set up builds on both Travis-CI and Appveyor, allowing Linux and Windows builds to be continuously monitored. - Travis-CI builds are customized through `.travis.yml`. - AppVeyor builds are customized through `appveyor.yml`. -- - The checking process runs `R CMD Check`, which includes unit testing, on a freshly provisioned system. -- - On both services, caching is helpful to avoid very long build/check times. --- # Let's roll our own - Use the [`usethis` package](https://usethis.r-lib.org/) for easy package setup. ```r library(here) library(usethis) # create package and change path create_package(here("ph292rtools")) proj_activate(here("ph292rtools")) # add license, readme, news, tests use_mit_license("My Name") use_readme_md() use_test("my-test") ``` -- - Let's browse https://usethis.r-lib.org/reference/index.html - For more, consider Hadley's book _R packages_: http://r-pkgs.had.co.nz/ - Bioconductor guidelines: https://www.bioconductor.org/developers/package-guidelines/ --- # Summary - R packages really aren't that hard. - R packages are really useful. - Distributing software and data - Organizing code for a paper - Organizing your misc. R functions - Look at others' packages, and learn from them. - Adopt the tools in the devtools and usethis packages --- class: center, middle # Me [nimahejazi.org](https://nimahejazi.org) _email:_ nhejazi -AT- berkeley -DOT- edu _twitter:_ [@nshejazi](https://twitter.com/nshejazi) _GitHub:_ [nhejazi](https://github.com/nhejazi)