Research

The Effect of Batch Size in Single Neuron Autoencoders

Nikhil Ghosh, Spencer Frei, Wooseok Ha

Dictionary learning is an important data science technique used in many scientific areas such as genomics. One assumes that the data are sparse combinations of elements of an unknown dictionary and the goal is to recover this dictionary. Can we solve this problem with algorithms which do not require specifying the data generative model? For instance, can we simply train a neural network using stochastic gradient descent (SGD), a strategy that has yielded immense practical successes across many domains? We make progress on this question by studying it in a simplified setting. We show that when the data are 1-sparse from an orthogonal dictionary, training a single neuron ReLU autoencoder recovers an element of this dictionary if and only if the batch size is smaller than the size of the dictionary. Even in this simplified setting, the analysis is highly nontrivial due to the stochastic nature of SGD and the non-convex objective function. We introduce tools from non-homogeneous random walk theory in order to tackle the problem and we believe these tools may be useful for analyzing other machine learning algorithms.

Heart disease genetics

Tiffany Tang, Omer Ronen, Abhineet Agarwal, Merle Behr, Karl Kumbier

Approximately 1 in 500 people suffer from hypertrophic cardiomyopathy (HCM), a common genetic heart disease where the heart muscle cells are larger than normal and cause the heart to work harder than it should. As a result, those with HCM are at higher risk for heart failure and other heart-related complications later in life. To better understand HCM, our group is building a stability-driven pipeline to identify and understand the genes and gene interactions that affect the size of heart muscle cells. This work builds upon ideas from previous work including iterative Random Forests and epiTree. In collaboration with the Ashley lab at Stanford, we are also validating our scientific recommendations through wet-lab experiments.

Interpreting neural networks

Chandan Singh, Wooseok Ha, Robbie Netzorg, Jamie Murdoch, Laura Rieger

This project is part of a broad theme running through our group on interpretable machine learning. Deep neural networks (DNNs) have achieved impressive predictive performance due to their ability to learn complex, non-linear relationships between variables. However, the inability to effectively visualize these relationships has led to DNNs being characterized as black boxes and consequently limited their applications. To ameliorate these problems, we have introduced multiple new algorithms for interpreting individual decisions made by neural networks.

This line of work began with scoring interactions in LSTMs via Contextual decomposition (ICLR 2018) and was later extended to generate hierarchical interpretations for a large class of neural networks, including CNNs (ACD, ICLR 2019). We then used these techniques to improve the generalization of neural networks during training (CDEP, ICML 2020) and to investigate the importance of different transformations in predicting cosmological parameters (TRIM, ICLR Workshop 2020).

Information Extraction for Pathology Reports

Briton Park, Aliyah Hsu, Nicholas Altieri

There exists decades of pathology reports which could be used to analyze the efficacy of treatments or to augment disease diagnoses. However, they are currently locked away in free-text, which prevents the application of statistical techniques. Our group is currently developing methods to convert these pathology reports across heterogeneous cancers and institutions into a structured database so that statistical methods can be brought to bear on them.

Adaptive wavelets

Wooseok Ha, Chandan Singh

This project aims to leverage the power of wavelets to achieve high predictive performance while maintaining interpretability and computational efficiency. Recent deep-learning models have achieved impressive prediction performance, but often sacrifice both of these. This line of work begins with adaptive wavelet distillation (AWD), a method which aims to distill information from a trained neural network into a wavelet transform. Specifically, AWD penalizes feature attributions of a neural network in the wavelet domain to learn an effective multi-resolution wavelet transform. We show how AWD works in two real-world settings: cosmological parameter inference, in close collaboration with cosmologist Francois Lanusse, and molecular-partner prediction, with Gokul Upadhyayula, head of the advanced bioimaging center at UC Berkeley. In both cases, AWD yields a scientifically interpretable and concise model which gives predictive performance better than state-of-the-art neural networks.

Causal inference: heterogeneous treatment effect estimation

Sören Künzel, Simon Walter

Most studies investigate phenomena that have different manifestations in different circumstances and these differences cannot be captured by statistics that estimate population average or sample average effects. Identifying this heterogeneity has assumed increasing importance in the last quarter century in several domains. For example, technology companies now conduct experiments on tens or hundreds of millions of subjects, which provides the power to detect fine-grained heterogeneity; this is desirable in practice because many interventions have zero effect on the outcome of interest for all but a small fraction of subjects. Similarly, there is increasing focus in medicine in providing treatments that are tailored to the peculiarities of individual patients. There is a rich literature on heterogeneous treatment effect estimation, beginning at least as early as 1865.
Our group suggested a new procedure for heterogeneous treatment effect estimation, called the X-learner that enjoys some optimality properties: the X-learner is a two stage meta-algorithm that first models the unrealized counterfactual and then applies an ordinary regression model to the difference of the actual and counterfactual outcome to produce estimators of the conditional average treatment effect; this procedure is shown to achieve a minimax rate under certain conditions.
In separate work, we suggest refinements to an existing method called the modified outcome method that render the procedure doubly robust — meaning that it consistently estimates the treatment effect if only one of the model for the counterfactual and probability of treatment assignment is correct.

Gene expresssion study

Karl Kumbier, Yu Wang

A fundamental problem in systems biology is to understand how regulatory interactions drive development and function of living organisms. We are working with the Sue Celniker and Ben Brown Labs at Lawrence Berkeley National Laboratory to investigate interactions among transcription factors (TFs) in Drosophila embryos. As part of this ongoing collaboration, we developed stability-driven Nonnegative Matrix Factorization (staNMF) to decompose gene expression images into “principal patterns” (PPs) that can be used to relate TFs to pre-organ regions. In addition, we developed iterative Random Forests (iRF) to identify stable, high-order interactions in high-throughput genomic data. Our ongoing work builds on these methods to identify high-quality experimental targets for wet lab validation.

Sampling Algorithms

Raaz Dwivedi, Yuansi Chen

Drawing samples from a known distribution is a core computational challenge common to many disciplines, with applications in statistics, probability, operations research, and other areas involving stochastic models. Recent decades have witnessed great success of Markov Chain Monte Carlo (MCMC) algorithms. These methods are based on constructing a Markov chain whose stationary distribution is equal to the target distribution, and then drawing samples by simulating the chain for a certain number of steps. From a theoretical view point, a core challenge is to provide convergence guarantees for such algorithms, namely the number of steps required to provide an approximate sample from the target distribution. In our work, we provide theoretical guarantees for several sampling algorithms as a function of the problem parameters. A few algorithms that we analyzed include: Randomized interior point methods for sampling from polytopes and certain Langevin algorithms to sample from log-concave distributions. Such guarantees can then be used for providing estimates for expectations of functions, probabilities of events and volume of certain sets.

Neuroscience: understanding The Visual Pathway

Reza Abbasi, Yuansi Chen

The volume and quality of data recorded from the brain are constantly increasing, giving us a better view of mental processes. We collaborate with neuroscience labs, primarily the Gallant lab, to develop methodology for analyzing such data. We focus on understanding vision by studying the representation of images and videos in early visual areas. These experiments are great examples for modern statistical work: both the stimuli (a video, or sequence of images) and the response (continuous brain-scans, or electrode recordings) are high-dimensional structured objects. We develop principled methods to relate the stimuli and responses for both prediction and interpretation purposes. These include, among others, methods for feature-extraction (both learned and engineered), such as building upon the scattering transform. Additionally, we focus on building methods for interpretation including DeepTune (a method for generating stable maximally activating stimuli from deep learning models) and compression. Our ongoing work focuses on developing interpretation methods for population-level analysis.

YU Group

Research Projects