Nima Hejazi

Assistant Professor of Biostatistics

Harvard T.H. Chan School of Public Health

Biography

I am an assistant professor in the Department of Biostatistics at the Harvard T.H. Chan School of Public Health. Prior to this, I was an NSF Mathematical Sciences Postdoctoral Research Fellow, working on causal inference and machine learning applied to problems with complex study designs, especially vaccine efficacy trials. I obtained my PhD in biostatistics from UC Berkeley, where I worked on non-/semi-parametric estimation and causal inference for continuous-valued exposures and on mediation analysis. In that time, I was on the founding core development team of the tlverse project, an open-source software ecosystem for targeted learning, and I was lucky to enjoy diverse scientific collaborations with the Fred Hutchinson Cancer Center, the Bill and Melinda Gates Foundation, and Netflix Research.

My research interests lie primarily in unifying statistical methodology for causal inference and machine learning under the central aim of developing efficient and robust, assumption-lean inferential techniques tailored for the applied sciences. Broadly speaking, I am often motivated by methodological topics from non- and semi-parametric inference (that is, from an assumption-lean or model-agnostic perspective), high-dimensional inference, applications of (targeted or minimum) loss-based estimation, corrections for the usage of biased sampling procedures, and the design of adaptive experiments. While my applied science interests are diverse, I have recently been captivated by problems that commonly arise in the study of infectious diseases and in their epidemiology, including clinical trials of these. I am also deeply interested in high-performance statistical computing and open-source software development to promote reproducibility, transparency, and data analysis “hygiene” in applied statistics and statistical data science.

Interests

causal machine learning and model-free causal inference
non/semi-parametric inference and assumption-lean methods
high-dimensional inference and bias-correction techniques
nonparametric estimation and statistical machine learning
statistical computing and reproducible data science

Education

PhD in Biostatistics (designated emphasis in Computational & Genomic Biology), 2021

University of California, Berkeley
MA in Biostatistics, 2017

University of California, Berkeley
BA with a triple major in Molecular & Cell Biology (em. Neurobiology), Psychology, and Public Health, 2015

University of California, Berkeley

Featured Publications

Nima Hejazi, Kara Rudolph, Mark van der Laan, Iván Díaz

February 2022 In Biostatistics

Nonparametric causal mediation analysis for stochastic interventional (in)direct effects

Causal mediation analysis has historically been limited in two important regards: (i) a focus has traditionally been placed on binary treatments and static interventions, and (ii) direct and indirect effect decompositions have been pursued that are only identifiable in the absence of intermediate confounders affected by treatment. We present a theoretical study of an (in)direct effect decom- position of the population intervention effect, defined by stochastic interventions jointly applied to the treatment and mediators. In contrast to existing proposals, our causal effects can be evaluated regardless of whether a treatment is categorical or continuous and remain well-defined even in the presence of intermediate confounders affected by treatment. Our (in)direct effects are identifiable without a restrictive assumption on cross-world counterfactual independencies, allowing for substantive conclusions drawn from them to be validated in randomized controlled trials. Beyond the novel effects introduced, we provide a careful study of nonparametric efficiency theory relevant for the construction of flexible, multiply robust estimators of our (in)direct effects, all the while avoiding undue restrictions induced by assuming parametric models of nuisance parameter functionals. To complement our nonparametric estimation strategy, we introduce inferential techniques for constructing confidence intervals and hypothesis tests, and discuss open source software implementing the proposed methodology.

Preprint PDF Code Project Project DOI

Nima Hejazi, Mark van der Laan, Holly Janes, Peter Gilbert, David Benkeser

September 2020 In Biometrics

Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials

The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response is often costly, which has motivated the usage of two-phase sampling for immune response sampling in clinical trials of preventive vaccines. In such trials, measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in a theoretical gap pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.

Preprint PDF Code Project Project Project Slides DOI

Iván Díaz, Nima Hejazi

February 2020 In Journal of the Royal Statistical Society, Series B (Statistical Methodology)

Causal mediation analysis for stochastic interventions

Mediation analysis in causal inference has traditionally focused on binary treatment regimes and deterministic interventions, as well as a decomposition of the average treatment effect in terms of direct and indirect effects. In this paper we present an analogous decomposition of the population intervention effect, defined through stochastic interventions. Population intervention effects provide a generalized framework in which a variety of interesting causal contrasts can be defined, including effects for continuous and categorical exposures. We show that identification of direct and indirect effects for the population intervention effect requires weaker assumptions than its average treatment effect counterpart. In particular, identification of direct effects is guaranteed in experiments that randomize the treatment and the mediator. We discuss various estimators of the direct and indirect effects, including substitution, re-weighted, and efficient estimators based on flexible regression techniques. Our efficient estimator is asymptotically linear under a condition requiring $n^{\frac{1}{4}}$-consistency of certain regression functions. We perform a simulation study in which we assess the finite-sample properties of our proposed estimators. We present the results of an illustrative study where we assess the effect of participation in a sports team on BMI among children, using mediators such as exercise habits, daily consumption of snacks, and overweight status.

Preprint PDF Code Project Project DOI

Recent Publications

(see CV for a full list)

Quickly discover relevant content by filtering publications.

Nima Hejazi, Philippe Boileau, Mark van der Laan, Alan Hubbard (2023). A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology. In Statistical Methods in Medical Research.

Preprint PDF Code Project Project Slides DOI

David Benkeser, Nima Hejazi (2023). Doubly-robust inference in R using drtmle. In Observational Studies.

PDF Project DOI

Nima Hejazi, Mark van der Laan (2023). Revisiting the propensity score's central role: Towards bridging balance and efficiency in the era of causal machine learning. In Observational Studies.

Preprint PDF Project DOI

Nima Hejazi, Mark van der Laan, David Benkeser (2022). haldensify: Highly adaptive lasso conditional density estimation in R. In Journal of Open Source Software.

PDF Code Project Project Project Source Document DOI

Philippe Boileau, Nima Hejazi, Mark van der Laan, Sandrine Dudoit (2022). Cross-validated loss-based covariance matrix estimator selection in high dimensions. In Journal of Computational and Graphical Statistics.

Preprint PDF Code Project DOI

See all publications

Recent & Upcoming Talks

Combining Causal Inference and Machine Learning for Model-agnostic Discovery in High-dimensional Biology

Mon, Mar 27, 2023 1:00 PM Boston, Massachusetts, USA

Evaluating Treatment Efficacy in Clinical Trials with Two-phase Designs Using Stochastic-interventional Causal Effects

Mon, Dec 19, 2022 4:20 PM London, England, UK

Using Stochastic-interventional Causal Effects to Evaluate Treatment Efficacy in Clinical Trials

Thu, Dec 1, 2022 12:00 PM Boston, MA, USA

Combining Causal Inference and Machine Learning for Model-agnostic Discovery in High-dimensional Biology

Fri, Nov 4, 2022 10:00 AM Edinburgh, Scotland, UK

Evaluating Treatment Efficacy in Vaccine Clinical Trials with Two-phase Designs Using Stochastic-interventional Causal Effects

Wed, Oct 5, 2022 12:00 PM Cambridge, MA, USA

See all talks

Teaching

current courses

I won’t be teaching during the 2022-2023 academic year. I’ll resume in 2023-2024.

past courses

Public Health 290: Biomedical Big Data Capstone Seminar (graduate student instructor to Prof. Mark van der Laan); Spring 2021 at the University of California, Berkeley.
Public Health 240B (also Statistics 245B): Survival Analysis and Causality (graduate student instructor to Prof. Mark van der Laan); Fall 2020 at the University of California, Berkeley.
Public Health 290: Biomedical Big Data Capstone Seminar (graduate student instructor to Prof. Alan Hubbard); Spring 2020 at the University of California, Berkeley.
Public Health 242C (also Statistics 247C): Longitudinal Data Analysis (graduate student instructor to Prof. Alan Hubbard); Fall 2019 at the University of California, Berkeley.
Public Health 290: Targeted Learning in Biomedical Big Data (graduate student instructor to Prof. Mark van der Laan); Spring 2018 at the University of California, Berkeley.

upcoming workshops

Causal Mediation Analysis at the Society for Epidemiologic Research meeting; 2023 June; co-taught with Iván Díaz and Kara Rudolph.
Targeted Learning in the tlverse: Techniques and Tools for Causal Machine Learning at the Joint Statistical Meetings (JSM); 2023 August; co-taught with Mark van der Laan, Alan Hubbard, Ivana Malenica, and Rachael Phillips.

recent workshops

Targeted Learning: Advanced Methods for Causal Machine Learning at the Eastern North American Region of the International Biometric Society (ENAR) Spring Meeting; 2023 March; co-taught with Mark van der Laan, Alan Hubbard, Ivana Malenica, and Rachael Phillips.
Targeted Learning I: Causal Inference Meets Ensemble Machine Learning at the Society for Epidemiologic Research meeting; 2022 June; co-taught with Mark van der Laan, Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
Targeted Learning II: Advanced Applications of Causal Inference at the Society for Epidemiologic Research meeting; 2022 June; co-taught with Mark van der Laan, Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
Modern Causal Mediation Analysis at the Society for Epidemiologic Research meeting; 2022 May (online) and 2022 June (in-person); co-taught with Iván Díaz and Kara Rudolph.
Targeted Machine Learning of the Causal Effects of Dynamic and Shift Interventions with the tlverse R packages at the American Causal Inference Conference; 2022 May; co-taught with Mark van der Laan, Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
Targeted Learning in the tlverse: Causal Inference Meets Ensemble Machine Learning at the Society for Epidemiologic Research Meeting; 2021 June; co-taught with Mark van der Laan, Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
Modern Causal Mediation Analysis at the Society for Epidemiologic Research Meeting; 2021 May; co-taught with Iván Díaz and Kara Rudolph.
Targeted Learning in the tlverse: Causal Inference Meets Ensemble Machine Learning at the Eastern North American Region of the International Biometric Society (ENAR) Spring Meeting; 2021 March; co-taught with Mark van der Laan, Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
The tlverse Software Ecosystem for Targeted Learning at the Conference on Statistical Practice; 2020 February; co-taught with Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
The tlverse Software Ecosystem for Causal Inference at the Atlantic Causal Inference Conference; 2019 May; co-taught with Mark van der Laan, Alan Hubbard, Jeremy Coyle, Ivana Malenica, and Rachael Phillips.

Carpentries workshops

I am a member of Software Carpentry and Data Carpentry, through which I work on curriculum development, maintenance of lesson materials, and workshop delivery.

Software Carpentry: Shell, Git, and R at the Berkeley Institute for Data Science; 2019 January; co-taught with Scott Peterson and Nelle Varoquaux.
Course materials here | GitHub repository here
Software Carpentry: Shell, Git, and Python at the Berkeley Institute for Data Science; 2018 July; co-taught with Kunal Marwaha.
Course materials here | GitHub repository here
Data Carpentry: Genomics at Lawrence Berkeley National Laboratory; 2018 May; co-taught with Adam Orr.
Course materials here | GitHub repository here

Software

Collected collateral damage from doing statistics research, hopefully useful to others.

Targeted Learning in the `tlverse`

The tlverse is an ecosystem of R packages for Targeted Learning, of which I am a co-founder and core developer. A few of the tlverse packages to which I’ve made significant contributions include

sl3: An R package providing a modern implementation of the Super Learner ensemble modeling algorithm that simultaneously exposes a flexible grammar for composing arbitrary pipelines for machine learning tasks. Joint work with Jeremy Coyle, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin.
[Docs] | [GitHub]
origami: An R package exposing a generalized framework for applying a great variety of cross-validation schemes to arbitrary estimation functions. Joint work with Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
[Docs] | [GitHub] | [CRAN] | [Paper]
hal9001: An R package providing an efficient implementation of the Highly Adaptive Lasso (HAL), a nonparametric regression estimator achieving near-parametric convergence rates under relatively mild assumptions. Joint work with Jeremy Coyle, Rachael Phillips, Lars van der Laan and Mark van der Laan.
[Docs] | [GitHub] | [CRAN] | [Paper]
tmle3shift: An R package for targeted maximum likelihood estimation of the causal effects of modified treatment policies on continuous-valued exposures, incorporates working marginal structural models for summarization of effect estimates. Joint work with Jeremy Coyle and Mark van der Laan.
[Docs] | [GitHub]

Causal Machine Learning

A significant focus of my research program centers on the intersection of causal inference and statistical machine learning. I’ve (co-)developed R packages for a range of problems: causal mediation analysis, evaluating the effects of stochastic interventions under two-phase sampling, conditional density estimation, causal segment discovery and offline policy evaluation, and survival analysis.

sherlock: An R package for employing causal machine learning and non/semi-parametric estimation to discover population segments (or subgroups) based on treatment effect heterogeneity. Flexible techniques for defining segment-specific treatment rules and efficient estimators of the causal effects of these dynamic treatment regimes are implemented. Joint work with Wenjing Zheng as part of an internship at Netflix Research.
[Docs] | [GitHub]
medshift: An R package for estimating the population intervention (in)direct effects based on stochastic interventions. Classical and efficient estimators are supported for the effects of incremental propensity score interventions and modified treatment policies. Joint work with Iván Díaz.
[Docs] | [GitHub]
medoutcon: An R package for efficient estimation of interventional (in)direct effects subject to intermediate confounding, including one-step and targeted minimum loss estimators. Joint work with Iván Díaz and Kara Rudolph.
[Docs] | [GitHub] | [Paper]
txshift: An R package for efficient estimation of and inference on causal effects of stochastic interventions on continuous-valued exposures. Robust estimation and efficient inference under two-phased sampling is supported. Joint work with David Benkeser.
[Docs] | [GitHub] | [CRAN] | [Paper]
haldensify: An R package for nonparametric conditional density estimation based on the highly adaptive lasso, designed for estimating the generalized propensity score. Joint work with David Benkeser and Mark van der Laan.
[Docs] | [GitHub] | [CRAN]
survtmle: An R package for the construction of targeted maximum likelihood estimates of marginal cumulative incidence in right-censored survival settings with and without competing risks, including estimation procedures that respect bounds. Joint work with David Benkeser.
[Docs] | [GitHub] | [CRAN]

High-Dimensional Biology

A parallel thread of my research concerns the development of novel statistical methodology for application in high-dimensional and computational biology. I have (co-)developed several R packages extending the Bioconductor Project.

biotmle: An R package for the model-free discovery of biomarkers from biological expression data, introducing a generalization of moderated statistics for variance stabilization of semiparametric estimators. Joint work with Alan Hubbard and Mark van der Laan.
[Docs] | [GitHub] | [Bioconductor] | [Paper]
scPCA: An R package for sparse contrastive principal component analysis, facilitating the recovery of stable and low-dimensional patterns from high-dimensional biological data while removing technical artifacts by making use of control samples. Joint work with Philippe Boileau and Sandrine Dudoit.
[GitHub] | [Bioconductor] | [Paper]
methyvim: An R package for genome-wide analysis of differential methylation across CpG sites, applying causal variable importance. Joint work with Mark van der Laan. Deprecated after Bioconductor v3.12.
[Docs] | [GitHub] | ~~[Bioconductor]~~
adaptest: An R package for multiple hypothesis testing with data adaptive target parameters in high-dimensional settings using Targeted Learning. Joint work with Weixin Cai and Alan Hubbard. Deprecated after Bioconductor v3.12.
[GitHub] | ~~[Bioconductor]~~ | [Paper]

Other Assorted Adventures

cvCovEst: An R package for asymptotically optimal, cross-validated, loss-based selection of covariance matrix estimators, tailored for use in high-dimensional settings. Joint work with Philippe Boileau, Brian Collica, Mark van der Laan, and Sandrine Dudoit.
[Docs] | [GitHub] | [CRAN] | [Paper]
nima: An R package housing my personal R toolbox, written to support statistical computing for research.
[Docs] | [GitHub] | [CRAN]