PB HLTH 295, Section 004
Statistics and Genomics Seminar

Spring 2015



[Home]


Thursday, January 22nd


Treatment Effect Heterogeneity
Peng Ding
Department of Statistics, Harvard University

Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start Impact Study, a large-scale randomized evaluation of a Federal preschool program, finding that there is indeed significant unexplained treatment effect variation.


Thursday, January 29th


Drugs that Reverse Disease Transcriptomic Signatures are More Effective in a Mouse Model of Dyslipidemia
Allon Wagner
Department of Electrical Engineering and Computer Sciences, UC Berkeley

High-throughput omics have proven invaluable in studying human disease, and yet day-to-day clinical practice still relies on physiological, non-omic markers. For example, the metabolic syndrome is diagnosed and monitored by blood and urine indices such as blood cholesterol levels. Nevertheless, the association between the molecular and the physiological manifestations of the disease, especially in response to treatment, has not been investigated in a systematic manner. To this end, we studied a mouse model of diet-induced dyslipidemia and atherosclerosis that was subject to various drug treatments relevant to the disease in question. Both physiological data and gene expression data (from the liver and white adipose) were analyzed and compared. We find that treatments that restore gene expression patterns to their norm are associated with the successful restoration of physiological markers to their baselines. This holds in a tissue-specific manner -- treatments that reverse the transcriptomic signatures of the disease in a particular tissue are associated with positive physiological effects in that tissue. Further, treatments that introduce large non-restorative gene expression alterations are associated with unfavorable physiological outcomes. These results provide a sound basis to in silico methods that rely on omic metrics for drug repurposing and drug discovery by searching for compounds that reverse a disease's omic signatures. Moreover, they highlight the need to develop drugs that restore the global cellular state of back to its healthy norm rather than rectify particular disease phenotypes.

A. Wagner, N. Cohen, T. Kelder, U. Amit, E. Liebman, D. M. Steinberg, M. Radonjic and E. Ruppin
Forthcoming in Molecular Systems Biology


Thursday, February 5th


When Covariate-Adjusted Response-Adaptive RCT meets Data-adaptive Loss-based Estimation
Dr. Wenjing Zheng
Center for AIDS Prevention Studies, UC San Francisco

Adaptive clinical trial design methods have garnered growing attention in the recent years, in large part due to their greater flexibility over their traditional counterparts. An adaptive trial design allows pre-specified modifications to key aspects of the on-going trial based on analysis of the accruing data, while preserving the validity and integrity of the trial. One such design is the so-called group-sequential Covariate-Adjusted Response-Adaptive (CARA) randomized controlled trial (RCT). In a CARA RCT, the treatment randomization schemes are allowed to depend on the patient's pre-treatment covariates (Covariate-Adjusted), and they can be modified during the course of the trial based on accruing information, including preceding patients' responses (Response-Adaptive), in order to meet some pre-specified trial objectives. In a group-sequential CARA RCT, such adjustments take place at interim time points given by sequential inclusion of blocks of c patients.

In this talk, we present a novel group-sequential CARA RCT design and inferential procedure that admit the use of flexible data-adaptive techniques. The proposed framework adopts a loss-based approach to construct more flexible CARA randomization schemes while exploiting data-adaptive estimators for the response model. In general, this approach allows better adaptation towards the user-supplied optimal randomization scheme through better variable adjustments and the targeted construction of an instrumental loss function. Under the proposed framework, the parameter of interest is non-parametrically defined and is estimated using the paradigm of Targeted Maximum Likelihood Estimation (TMLE) based on such an adaptive sampling scheme. We establish that under appropriate empirical process conditions, the sequence of randomization schemes converges to a fixed scheme, and the proposed TMLE estimator is consistent and asymptotically normal, thus delivering valid confidence intervals. We illustrate the proposed framework with the use of LASSO regressions to estimate the conditional response given treatment and baseline covariates.


Thursday, February 12th


Inference of Executable Biological Models
Professor Rastislav Bodik
Computer Science Division, UC Berkeley

Executable biology is a branch of systems biology in which biological models are executable programs. Executable models simulate the behavior of the cell and mechanistically explain the function of a cellular system, such as how stem cells coordinate their fate determination. My group works on algorithms that synthesize executable models automatically from experimental data and prior knowledge. In this talk, I will illustrate the use of program synthesis in systems biology on VPC LIN-12/Notch signaling in c. elegans (using mutation experiments) and on inference of EGF pathways (using phosphorylation time-course data). I will also describe algorithms for summarizing the ambiguity in the model caused by insufficient data; and algorithms for suggesting which experiments are worthwhile to conduct because they reduce the ambiguity in a model.

Joint work with Ali Sinan Koksal, Anthony Gitter, Kirsten Beck, Aaron McKenna, Saurabh Srivastava, Nir Piterman, Yewen Pu, Alejandro Wolf-Yadlin, Ernest Fraenkel, Jasmin Fisher


Thursday, February 19th


Genome-wide Profiling of Translation Initiation and Protein Synthesis
Professor Nicholas Ingolia
Department of Molecular and Cell Biology, UC Berkeley

Gene expression is directed, ultimately, at the production of functional protein products encoded by most genes. Protein synthesis also is one of most demanding cellular biosynthetic processes. Translation of mRNA into protein is thus a key point of regulation in determining the identities and quantities of protein that a cell produces. Despite the broad biological impact of this regulation, our understanding of its effects have been limited by the difficulty of measuring in vivo translation.

We developed the ribosome profiling technique to address this challenge by providing global, quantitative measurements of in vivo translation. In ribosome profiling, nuclease digestion removes most mRNA, leaving behind only a short ~30 nt fragment physically enclosed by the ribosome and thus protected from digestion. The sequence of this fragment indicates the exact position of the ribosome on a transcript. Thus, the presence of ribosome footprints provides a direct annotation of the sequences in the cell that are being translated into protein. The density of these ribosome-protected footprints on an mRNA indicates its level of translation, providing a measure of gene expression at the level of protein synthesis, and thus a tool to detect translational control of gene expression.

Variations in the density of footprints across individual transcripts also provide insights into the mechanics of protein production in the cell. Footprint density is proportional to the dwell time of the ribosome on individual codons, providing a new opportunity to investigate the molecular basis underlying the strong preference for specific codon usage within the collection of synonymous codons that encode msot amino acids. We developed a robust method to extract codon translation rates from ribosome profiling data, and identify causal features associated with elongation speed. Though the speed of translation correlates with codon usage bias in yeast, we show that neither elongation rate nor translational efficiency is greatly affected by experimental manipulation of tRNA abundance. Our results suggest that correlation between codon bias and efficiency arises as selection for codons to utilize translation machinery efficiently in highly translated genes.


Thursday, February 26th


Using Mixtures of Biological Samples as Genome-Wide Controls
Dr. Jerod Parsons
Stanford/National Institute for Standards and Technology (NIST) Advances in Biomedical Measurement Science (ABMS) Program

Genome-scale "-omics" measurements are challenging to benchmark due to the enormous variety of unique biological molecules involved. Both spike-in controls and mixtures of previously-characterized samples can be used to benchmark repeatability and reproducibility by adding known ground-truth values to the experiment. I show that RNA-sequencing transcriptome expression measures from mixtures can be readily modeled and provide useful metrics for evaluating data quality.


Thursday, March 5th


Powerful Design and Association Methods for Next-generation Sequencing Studies
Shuang Feng
Department of Biostatistics, University of Michigan, Ann Arbor

Advances in next-generation sequencing are enabling explorations of association between rare coding variants and complex traits. However power is typically limited to detect these rare variants. In this talk, we discuss powerful study design and association methods and meta-analysis methods for NGS studies. We describe situations where family-based studies provide greater power than studies of unrelated individuals to detect rare variants associated with moderate to large changes in trait values. Furthermore, considering high sequencing cost, we propose a novel likelihood-based method, RAREFY, to prioritize individuals that are more likely to carry trait-associated rare variants to sequence. Finally, to ensure power for these rare variant association analyses, we develop family-based burden tests, variable frequency threshold tests and sequence kernel association tests (SKAT) for both single study analysis and meta-analysis.


Thursday, March 19th


A Genetic and Socio-Economic Study of Mate Choice in Latinos Reveals Novel Assortment Patterns
Dr. James Y. Zou
Microsoft Research New England and MIT

Nonrandom mating in human populations has important implications for genetics and medicine as well as for economics and sociology. In this study, we performed an integrative analysis of a large cohort of Mexican and Puerto Rican couples using detailed socio-economic attributes and genotypes. We found that in ethnically homogeneous Latino communities, partners are significantly more similar in their genomic ancestries than expected by chance. Consistent with this, we also found that partners are more closely related -- equivalent to between third and fourth cousins in Mexicans and Puerto Ricans -- than matched random male-female pairs. Our analysis showed that this genomic ancestry similarity cannot be explained by socio-economic factors alone. Strikingly, the assortment of genomic ancestry in couples was consistently stronger than even the assortment of education. We found enriched correlation of partners' genotypes at genes known to be involved in facial development, suggesting a biological mechanism driving assortative mating. We replicated our results across multiple locations. In the talk, I'll also discuss some of the new statistical methods that we developed to infer the parental genomic ancestry from the offspring's genotypes.


Thursday, April 2nd


Non Parametric DNA Copy Number Segmentation Using Kernels
Morgane Pierre-Jean
Laboratoire Statistique et Genome, UMR CNRS 8071, USC INRA

A number of change-point detection methods have been proposed to analyse DNA copy number and allele B fraction profiles. These profiles are characterized by abrupt changes in their distribution (mean, number of modes, variance...). However, available approaches do not directly tackle this problem. In fact they first pre-process and transform the data and then detect abrupt changes in the mean of the pre-processed signal. This pre-processing results in a loss of information.

The recently proposed kernel based segmentation approach offers a unified framework to detect changes in the whole distribution of a signal and is an interesting alternative to this ad-hoc pre-processing. However, kernel based segmentation is computationally inefficient and cannot be applied as is to large DNA copy number profiles. Indeed for an arbitrary kernel its complexity is quadratic in the size of the data both in space and time. In order to apply this method on large DNA copy number profiles, we propose also an heuristic.

We illustrate the performance of the kernel based segmentation and of its heuristic on the copy number the and allele B profiles for which we designed an adapted kernel. We assess the competitive performances of our approach using realistic profiles simulated using the acnr R package.


Thursday, April 9th


Accounting for Technical Noise in Single-Cell RNA-Seq Experiments
Dr. Philip Brennecke
Department of Genetics, Stanford University

Single-cell RNA-seq can yield valuable insights into the variability within a population of cells. I will describe a quantitative statistical method that distinguishes true biological variability from the high levels of technical noise in single-cell RNA-seq experiments. Our approach quantifies the statistical significance of observed cell-to-cell variability in expression strength on a gene-by-gene basis and thus allows assessing of “within population” heterogeneity.

I will also discuss recent work, where we apply our method in the context of T cell negative selection: Using single-cell RNA-seq we look at regulation of gene expression of self-antigens, which have to be ectopically expressed in medullary thymic epithelial cells (mTECs) to ensure successful self-tolerance induction.


Thursday, April 16th


A Convex Formulation for Joint RNA Isoform Detection and Quantification from Multiple RNA-Seq Samples
Professor Jean-Philippe Vert
Mines ParisTech and Institut Curie, Paris, France

Detecting and quantifying isoforms from RNA-seq data is an important and challenging task. The problem is often ill-posed since different combinations of isoforms may correctly explain the observed read counts, particularly at low coverage. Assuming that some isoforms are shared between samples, simultaneously detecting isoforms from multiple samples can yield better estimation by increasing the total number of reads available and the diversity in relative abundances between different transcripts. We propose a new method for solving this isoform deconvolution problem jointly across several samples. The method is an extension of the FlipFlop technique, which was initially proposed to identify and quantify isoforms from a single sample, and is formulated as a convex optimization problem. We demonstrate the benefits of combining several samples for isoform detection, and show that our approach outperforms simple pooling strategies and other methods based on mixed integer programming. Source code is freely available as an R package from the Bioconductor web site (http://www.bioconductor.org) and more information is available at http://cbio.ensmp.fr/flipflop.

(Joint work with Elsa Bernard, Laurent Jacob, Julien Mairal and Eric Viara)


Thursday, April 23rd


Statistical and Algorithmic Challenges at Whole Biome
Dr. James Bullard
Whole Biome

Whole Biome Inc., is a 2-year old startup focused on the development of novel diagnostics and therapeutics targeting the human microbiome. In this talk, I will focus on the different computational and statistical techniques used when analyzing metagenomic samples. In addition to challenges related to the analysis of metagenomics data, I will discuss building an informatics-based company on the public cloud and some best practices.


Thursday, April 30th


The Early Response to Ecdysone in 41 Diverse Cell Lines
Marcus Stoiber
Graduate Group in Biostatistics, UC Berkeley

Endocrine signals transduced by nuclear hormone receptors elicit major biological responses that include differentiation, cell growth, cell death and metamorphosis. Responses to these signals alter gene expression and each cell type responds differently to a single, common endocrine signal. In Drosophila, the molting hormone 20-hydroxyecdysone (20E) directs major developmental transitions during the life cycle. Here we survey the early ecdysone responses of 41 Drosophila cell lines, representing diverse transcriptional and cytological states. We observe genes that are widespread in their responsiveness, those responding in most lines, and many more whose responsiveness is restricted to one or a few lines. Genes in the widespread class include those previously identified in ecdysone responses studies in few tissues and genetic analyses. Many restricted genes are induced in some cell lines, repressed in others and fail to respond in still others. Expression of the ecdysone receptor (EcR) expression level predicts both the extent of global transcriptional response and the kinetic progression of cellular responses, and hence EcR titer appears to be rate limiting for ecdysone transduction. Promoter motif compositions combined with transcription factor titers provide significant predictive power for the identification of restricted responses. We characterize the conditional responsiveness for genes with shared promoter architecture and find that transcripts initiating from a bidirectional promoter can be independently controlled in ecdysone response. These findings provide the basis for decoding the specificity of ecdysone responses, and for understanding the pathways of nuclear hormone receptors.