one Industry Alliance speaker @ 30 minutes, two students @ 15 minutes each

How to make predictions when you're short on information

Affiliation: Department of Statistics, Department of Electrical Engineering and Computer Science, University of California, Berkeley

With the advent of massive social networks, exascale computing, and high-throughput biology, researchers in every scientific department now face profound challenges in analyzing, manipulating and identifying behavior from a deluge of noisy, incomplete data. In this talk, I will present a unifying optimization framework to make such data analysis tasks less sensitive to corrupted and missing data by exploiting domain specific knowledge and prior information about structure. Specifically, I will show that when a signal or system of interest can be represented by a combination of a few simple building blocks---called atoms---it can be identified with dramatically fewer sensors and accelerated acquisition times. For example, RADAR signals can be decomposed into a sum of elementary propagating waves, metabolic dynamics can be analyzed as sums of multi-index data arrays, and aggregate rankings of sports teams can be written as sums of a few permutations. In each application, the challenge lies not only in defining the appropriate set of atoms, but also in estimating the most parsimonious combination of atoms that agrees with a small set of measurements. I will present a methodology to tackle both of these challenges and demonstrate how to scale the resulting algorithms to the massive data sets we now commonly acquire.

Digging trenches, building ramparts and making sense of metagenomic samples

Affiliation: University of California, Berkeley

Abstract: Metagenomics attempts to sample and study all the genetic material present in a community of microorganisms in environments that range from the human gut to the open ocean. This enterprise is made possible by high-throughput pyrosequencing technologies that produce a "soup'' of DNA fragments which are not a priori associated with particular organisms or with particular locations on the genome. Statistical methods can be used to assign these fragments to locations on a reference phylogenetic tree using pre-existing information about the genomes of previously identified species. Each metagenomic sample thus results in a cloud of points on the reference tree. In seeking to answer questions such as what distinguishes the vaginal microbiomes of women with bacterial vaginosis from those of women who don't, one is led to consider statistical methods for distinguishing between two or more such clouds. I will discuss joint work on this problem with Erick Matsen from the Fred Hutchinson Cancer Research Center in which we use ideas that go back to Gaspard Monge's 1781 treatise on the efficient construction of earthen fortifications.

Thin Factor Bias in Equity Return Covariance Estimation

Affiliation: Citadel, Risk Analyst

Abstract: It is generally recognized that time-series of equity returns contain spurious in-sample correlations that can be misleading in the context of portfolio optimization. Factor models mitigate this problem by positing that equity returns are decomposed into systematic (factor) and diversifiable (stock specific) components. This modeling assumption filters structural correlations from spurious ones due to random coincidences among stocks. In light of the wide-spread use of factor models in finance, it is perhaps surprising that many commercially available models are materially biased as estimators of the stock-stock covariance matrix. In particular, these models produce biased risk forecasts for standard portfolios associated to the factors themselves, and this bias increases with the “thinness” of the factors. We show how “thin factor bias” arises from sampling error inherent in the factor return estimation process and discuss an approach to correct it.

Combining Network Methods and Survival Analysis to Model Prescription Behavior

Affiliation: Deloitte Consulting, Director

Since the design of marketing campaigns for new drugs can have a substantial impact on a drug’s success, pharmaceutical manufacturers often spend more on marketing a drug than on its development. Traditional marketing methods have focused on historically high prescribers or other measures that score a physician’s propensity to prescribe a drug in isolation from other physicians. However, as several researchers have found that “prescribing drugs is contagious,” in that physicians are likely to influenced by their peers in deciding which drugs to prescribe. A team at Deloitte (including Greg Szwartz, Stephen Bay, Krishna Kumaraswamy, Steve Berman, Jim Guszcza and Amin Torabkhani) developed a model of adoption probability that includes both measures of a physician’s centrality in a network of physicians and historical prescription behavior in survival model. This model was applied with considerable success in increasing prescription revenue and decreasing marketing expenditures.

Intelligent Personal Assistants: a view from the intersection of data mining and mobile services

Affiliation: NTT Docomo, Senior Research Engineer

Abstract: With the introduction of Siri, the concept of an "intelligent personal assistant" residing
on one's mobile phone has become widely understood. What is not widely known is that the
first intelligent interactive personal assistant for Android came not from Google but from
DOCOMO, in the form of the "Shabette Concier" service in Japan in 2012. With the introduction
of "Google Now" later in 2012, the world was introduced to another kind of intelligent personal assistant,
one that was predictive rather than just reactive. At DOCOMO Innovations in Palo Alto, we
have worked on designing interactive and predictive intelligent personal assistant applications
running on mobile devices. This talk will describe the challenges and design choices
involved. For "Shabette Concier", this included machine learning for task identification,
collection of training data for training the machine learning algorithm, and entity extraction
from natural language text. For the predictive assistant, called "Tap de Concier," the
challenges of security, personalization, and always-on availability led to a novel architecture
design that is an example of "massive" (in terms of numbers of users) data mining on "small"
data sets (each user's information).

A case for nonconvex optimization

Recent years have brought about a flurry of work on convex relaxations of nonconvex problems. For instance, the convex l_1 norm is used as a convex surrogate for the nonconvex l_0 penalty, which counts the number of nonzeros in a vector. Convex objectives have the attractive property that local optima are also global optima, and these optima may be found efficiently.
We present results following a line of current work that advocates the use of nonconvex regularizers. Although the resulting objective functions possess multiple local and global optima, we show that for interesting classes of functions arising from statistical estimation problems, both local and global optima are statistically consistent. Our work is the first of its kind to provide sufficient conditions under which local and global optima are clustered together, and presents a favorable case in the realm of nonconvex methods for statistical estimation.
This is joint work with Martin Wainwright.

CAGe - A hybrid pipeline for efficient variant calling

We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods. CAGe runs on a deeply sequenced human whole genome sample in less than 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.

The Geometry of Kernelized Spectral Clustering

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture of non-parametric distributions. The difficulty of this label recovery problem depends on the overlap between mixture components and how easily a mixture component is divided into two non-overlapping components. When the overlap is small compared to the indivisibility of the mixture components, the principal eigenspace of the population-level normalized Laplacian operator is approximately spanned by the square-root kernelized component densities. In the finite sample setting, embedded samples from different components are approximately orthogonal with high probability when the sample size is large. As a corollary we control the misclustering rate of spectral clustering under finite mixtures with nonparametric components.

Integrating Real World Applications in the Classroom: Experiences
from the Master's Capstone Course

Affiliation: Department of Statistics, University of California, Berkeley

This talk will discuss experiences with a project-based approach to
teaching data science, including collaboration and opportunities for
engagement with real world data providers.

Coexistence in preferential attachment networks

We introduce a new model of product adoption that focuses on word-of-mouth recommendations. Specifically, when a new node joins the network, it chooses neighbors according to preferential attachment, and then chooses its type based on the number of initial neighbors of each type. This can model, e.g., a new cell-phone user choosing a cell-phone provider. The main qualitative feature of our model is that often several competitors will coexist, which matches empirical observations in many current markets. This is joint work with Tonci Antunovic and Elchanan Mossel.