Recent Publications

(see CV for a full list)

Current statistical inference problems in areas like astronomy, genomics, and marketing routinely involve the simultaneous testing of thousands of null hypotheses. For high-dimensional multivariate distributions, these hypotheses may concern a wide range of parameters, with complex and unknown dependence structures among variables. In analyzing such hypothesis testing procedures, gains in efficiency and power can be achieved by performing variable reduction on the set of hypotheses prior to testing. We present an approach using data-adaptive multiple testing that applies data mining techniques to screen the full set of covariates on equally sized partitions of the sample via cross-validation. This generalized screening procedure is used to create average ranks for covariates, which are then used to generate a reduced (sub)set of hypotheses.

We focus on variable importance analysis in high-dimensional biological data sets with modest sample sizes, using semiparametric statistical models. We present a method that is robust in small samples, but does not rely on arbitrary parametric assumptions, in the context of studies of gene expression and environmental exposures. Such analyses are faced not only with issues of multiple testing, but also the problem of teasing out the associations of biological expression measures with exposure, among confounds such as age, race, and smoking. Specifically, we propose the use of targeted minimum loss-based estimation, along with a generalization of the moderated empirical Bayes statistics, to obtain estimates of variable importance measures. The result is a data-adaptive approach that can estimate individual associations in high-dimensional data, even in the presence of relatively small sample sizes.

biotmle is an R package that facilitates biomarker discovery by generalizing the moderated statistics for use with asymptotically linear target parameters.

origami is an R package that provides a general framework for the application of cross-validation schemes to particular functions. By allowing arbitrary lists of results, origami accommodates a range of cross-validation applications.

Recent Talks

(see CV for a full list)

Biostatistics Seminar Series, Division of Biostatistics, University of California, Berkeley

Graduate student admit day, Division of Biostatistics, University of California, Berkeley

The Hacker Within, Berkeley Institute for Data Science

Statistical Causal Inference and Applications to Genetics, Centre de Recherches Mathematiques