Generalized application of empirical Bayes statistics to asymptotically linear parameters

Abstract

The exploratory analysis of high-dimensional biological sequencing data has received much attention for its ability to allow the simultaneous screening of numerous biological characteristics at resolutions unimaginable just two decades ago. While there has been an increase in the dimensionality of such data sets in studies of environmental exposure and biomarkers, two important questions have received less attention than deserved: (1) how can individual estimates of independent associations be derived in the context of many competing causes while avoiding model misspecification, and (2) how can accurate small-sample inference be obtained when data-adaptive techniques are employed in such contexts. The central focus of this paper is on variable importance analysis in high-dimensional biological data sets with modest sample sizes, using semiparametric statistical models. We present a method that is robust in small samples, but does not rely on arbitrary parametric assumptions, in the context of studies of gene expression and environmental exposures. Such analyses are faced not only with issues of multiple testing, but also the problem of teasing out the associations of biological expression measures with exposure, among confounds such as age, race, and smoking. Specifically, we propose the use of targeted minimum loss-based estimation, along with a generalization of the moderated empirical Bayes statistics of Smyth, relying on the influence curve representation of a statistical target parameter to obtain estimates of variable importance measures. The result is a data-adaptive approach that can estimate individual associations in high-dimensional data, even in the presence of relatively small sample sizes.

Publication
Master’s thesis, Graduate Division, University of California, Berkeley
Date