Finite-Sample Inference and Moderated Statistics for Asymptotically Linear Parameters

Biostatistics Seminar Series, Division of Biostatistics, University of California, Berkeley
Berkeley, California, United States


Two important questions have received less attention than deserved in the analysis of high-dimensional biological data: (1) how can individual estimates of independent associations be derived in the context of many competing causes while avoiding model mis-specification, and (2) how can accurate small-sample inference be obtained when data-adaptive techniques are employed in such contexts. We focus on variable importance analysis in high-dimensional biological data sets with modest sample sizes, using semi-parametric statistical models. We present a method that is robust in small samples, but does not rely on arbitrary parametric assumptions, in the context of studies of gene expression and environmental exposures. Such analyses are faced not only with issues of multiple testing, but also the challenge of teasing out the associations of biological expression measures with exposure, among confounds such as age, race, and smoking. Specifically, we propose the use of targeted minimum loss-based estimation, along with a generalization of the moderated empirical Bayes statistics of Smyth, relying on the influence curve representation of a statistical target parameter to obtain estimates of variable importance measures. The result is a data-adaptive approach that can estimate individual associations in high-dimensional data, even in the presence of relatively small sample sizes.