one Industry Alliance speaker @ 30 minutes, two students @ 15 minutes each

Statistical Experiences Outside Academia

Affiliation: Walter and Eliza Hall Institute of Medical Research and University of California, Berkeley

Abstract: Apart from a 4½-year spell in a government scientific and industrial research organization, I've spent my career in academia: a university or research institute. This does not mean I have no statistical interest in what goes on in the outside world. On the contrary, some of my most exciting, challenging and rewarding statistical experiences have come and continue to come from my interactions with government, business, industry. In this talk I'll give some partial answers to the question: How, why, when, where and what? A past example I'll touch upon spanned several years and 3 Berkeley PhD theses, involved modeling and managing a California salmon population. One analysis centred around a non-linear state-space model, and in the course of the research I failed to discover the particle filter. More recent challenges involve practical aspects of personalized medicine, such as maintaining assay quality and identifying the subset of a population that benefits from a treatment. It will become clear during my talk why I am a big supporter of the department's IAP and BSTARS, and hope the program and the symposium both prosper.

Large Scale Statistical Learning Problems

Affiliation: University of California, Berkeley

Abstract: In many large-scale, high-dimensional prediction problems, computational resources have a critical impact on performance. This talk describes recent work on problems of this kind in three settings. First, we consider model selection problems: given limited computational resources, is it better to gather more data and estimate a simpler model, or gather less data and estimate a more complex model? We introduce methods that have near-optimal performance for a given amount of computation, that is, devoting all of our computational resources to the best model would not have led to a significant performance improvement. Second, we consider stochastic optimization procedures for statistical learning problems that are so large that they require distributed computational resources. We consider in particular non-smooth convex optimization problems, such as those that arise in pattern classification methods like support vector machines. We introduce a method for distributed stochastic optimization with optimal rates. Finally, we discuss methods for learning to control Markov decision processes, which are natural models for sequential decision problems. We introduce a strategy that performs well compared to any policy in some comparison class. This approach is promising for large-scale sequential decision problems, where near-optimal performance is intractable, since its performance depends on the complexity of the comparison class---rather than the size of the state space. Joint work with Yasin Abbasi-Yadkori, Alekh Agarwal, John Duchi and Martin Wainwright.

Data Enriched Linear Regression

Affiliation: Google, Statistician

Abstract: Even in the age of big data, small data sets remain relevant. One problem faced by Internet companies is combining small data sets with high quality but high cost observations (e.g. a panel recruited by a probability sample), with larger data sources such as log files. More generally, one source may have lower bias while the other has lower variance. Regressions in the large data set may be similar though not identical to those in the smaller one. We address this problem via Stein shrinkage between the two data sets where the goal is to predict in the small data set. The method generalizes small area estimation from survey sampling, and is also an example of transfer learning. For linear regression we give conditions under which data enriched linear regression uniformly outperforms the estimate that simply uses the small data set, no matter how large the bias between the two populations is. The improvement sets in at dimension 5, not 3 which arises in Stein shrinkage of means. We also look at L1 alternatives to Stein shrinkage. If time allows, we may further talk about some applications for online video ads reach estimation. This is joint work with colleagues Minghui Shi, Art Owen, Jim Koehler and Nicolas Remy.

Statistics at Genentech - An Overview

Affiliation: Genentech, Global Head and Director, Nonclinical Biostatistics

Abstract: Coming soon

Biology Is Noisy, Sparse and Ambiguous: Developing Advanced Molecular Diagnostics

Affiliation: Veracyte, Inc. Chief Scientific Officer and Senior VP, Research and Development

Abstract: Clinical samples, especially small biopsies taken in a doctor's office, present some very unique challenges in training classifiers. Not only are these specimens heterogeneous and data derived from them noisy, but collection methods also introduce variability and training class labels can be ambiguous. We have encountered all of these problems, and more, in the development of a molecular diagnostic to classify thyroid nodules pre-operatively as benign or malignant. Our molecular test, which measures the RNA expression of 167 genes, classifies thyroid fine-needle aspirates using a support vector machine and has been shown in a large independent validation of more than 5000 patients to predict class with very high sensitivity while maintaining clinically useful specificity. Many of the lessons we learned in the development of this classifier are directly applicable to solving new problems in clinical diagnostics.

High-dimensional linear regression: how to choose the objective function

Abstract: Given n observations (X_{i},Y_{i}) following the standard linear model Y_{i} = X_{i}'β + ε with i.i.d. errors ε_{i}, the class of regression M-estimates defined by

argmin ρ(Y_{i} - X_{i}' β)

has well known properties when p, the number of predictors, is negligible relative to n. In particular, the best estimate is obtained when ρ is equal to the
negative logarithm of the error distribution, in accordance with the maximum
likelihood principle. However, in the case where p is not negligible relative to n,
the maximum likelihood principle fails. In this talk we review new results for
the asymptotic distribution of such M-estimates when p/n is roughly constant. We show how these results can be used to find new objective functions ρ which beat all other M-estimators, in particular the usual maximum likelihood estimator. An interesting feature of these new functions is that they depend on the dimensionality p/n.

Graph Learning With Corruptions

Abstract: Graphical models are used in many application domains, running the gamut from computer vision and civil engineering to political science and epidemiology. In many applications, estimating the edge structure of an underlying graphical model is of significant interest. For instance, a graphical model may be used to represent friendships between people in a social network or links between organisms with the propensity to spread an infectious disease.
However, data in real-world applications are often observed uncleanly, and observations may be systematically corrupted according to mechanisms such as additive noise or missing data. Running standard machine learning algorithms on such corrupted data often leads to systematically biased solutions, which are inconsistent even in the limit of infinite data. We hereby present new methods for edge recovery in graphical models with systematically corrupted observations. We show how to modify existing machine learning algorithms such as regularized linear regression and the graphical Lasso to accommodate for systematic corruptions, and demonstrate the theoretical and practical consequences of our corrected algorithms for learning in Gaussian and discrete-valued graphs.

How Quickly Do Ensembles Converge

Abstract: Some of the most widely used methods in machine learning are based on the principle that a collection of weak learners can be aggregated to form a single strong learner. This set of approaches is commonly referred to as "ensemble methods", and includes random forests, boosting, and bagging as well known examples. A fundamental issue that determines both the statistical performance and computational cost of ensemble methods is the choice of the number of weak learners. Despite intense study over the last 10 to 15 years, there has been little theoretical understanding of how performance grows as a function of the number of weak learners. In particular, one would like to know how quickly the error rate err_n of an ensemble of n classifiers converges to its limiting value e*. In this talk, I will show that for methods such as random forests and bagging, the rate of convergence is given by err_n - e* = c/n + o(1/n), where the constant c has an exact formula, and can be estimated from data. As a consequence, this result offers a principled and data-driven way to choose the number of classifiers.

Classification of Sparse, Irregularly Sampled Time Series and Feature Measurement Error

Abstract: In high dimensional classification problems, a common practice is to extract features from each observation and then train a classifier on the features and associated observation classes. Here, I consider classification of periodic variable stars, which are essentially sparse, irregularly sampled periodic time series. The sampling of the time series introduces measurement error into derived features. Classifiers which do not account for the measurement error often have high misclassification rates. In this talk I introduce a time series resampling method, noisification, for addressing the measurement error, present results from empirical studies on astronomy data sets, and highlight the relationship between measurement error and regularization of classifiers.

Large-scale Spectral Disambiguation

Abstract: Disambiguation means clustering items by the entities they reference. Examples include figuring out who wrote what, whom a pronoun refers to, and what RNA molecule a short read came from. We focus on author disambiguation. There are n items (mentions of author names) and O(n) clusters (actual authors) to recover. Each mention has a small number of observed attributes, but O(n) attributes are observed in total. The induced similarity matrix is dense. In this setting, standard clustering algorithms either cannot effectively consider essential attributes, or are intractable due to at least quadratic runtime. We propose and evaluate a supervised, O(n log n) disambiguation procedure based on recursive spectral bipartitioning and an efficient representation of the similarity matrix.

Reconstructing Visual Experiences From Brain Activity

I will discuss our work in developing a brain decoder that can
reconstruct videos from recorded brain activity.
In this work, we recorded brain activity using functional MRI from the visual cortex
of subjects, as they watched a sequence of short video clips.
The objective of the decoder was to generate a video clip similar to the one shown, based
only on measurements from the brain (after training on separate data).
Our decoder combined three sources of information:
(a) regression models relating the video to the evoked brain activity,
(b) the multivariate distribution of the prediction errors of the regressions,
and (c) a prior of natural videos sampled from public video repositories.
By combining these sources, the decoder provided remarkable reconstructions of the displayed videos,
demonstrating that dynamic brain activity can be decoded using current technology.
This work is in collaboration with Shinji Nishimoto, An Vu, Thomas Naselaris, Bin Yu and Jack Gallant.

Reconstructing Visual Experiences From Brain Activity

Abstract:I will discuss our work in developing a brain decoder that can
reconstruct videos from recorded brain activity.
In this work, we recorded brain activity using functional MRI from the visual cortex
of subjects, as they watched a sequence of short video clips.
The objective of the decoder was to generate a video clip similar to the one shown, based
only on measurements from the brain (after training on separate data).
Our decoder combined three sources of information:
(a) regression models relating the video to the evoked brain activity,
(b) the multivariate distribution of the prediction errors of the regressions,
and (c) a prior of natural videos sampled from public video repositories.
By combining these sources, the decoder provided remarkable reconstructions of the displayed videos,
demonstrating that dynamic brain activity can be decoded using current technology.
This work is in collaboration with Shinji Nishimoto, An Vu, Thomas Naselaris, Bin Yu and Jack Gallant.

Beyond K-means and beyond clustering: BP-means and related ideas

Abstract: K-means is fast and conceptually straightforward. But it is designed to find a known number of equally-sized spherical clusters (mutually exclusive and exhaustive groups of data points that reflect latent structure). Bayesian methods have proven effective at discovering many other types of useful latent structure in data. But these methods are slower and require more background knowledge than K-means. We have designed a method to approximate Bayesian solutions to these problems in a form close to the K-means objective function. This form yields faster and simpler learning algorithms. We show how our method produces an objective function for learning groups of data points called features that need not be exclusive or exhaustive. We demonstrate novel and fast performance in experiments.