PB HLTH 292, Section 020
Statistics and Genomics Seminar

Spring 2013

Thursday, January 24th

Metabolic Reconstruction and the Evolution of Plant Metabolism
Dr. Lee H. Chae
Department of Plant Biology, Carnegie Institution for Science, Stanford, CA

The appearance and evolution of more than a quarter-million angiosperm species represents one of the most successful terrestrial radiations in natural history. To date, little is known about the biological capacities that allowed plants to diversify so prodigiously and occupy the variety of niches in which they inhabit. Here, we are interested in elucidating the role that metabolism has played in the evolution and diversification of plant life. To enable our investigation, we developed a machine-learning-based framework for the functional classification of enzymes, and we used it to reconstruct and compare the genome-scale metabolic networks of 11 plant species. In this talk, I’ll present an overview of the approach, as well as discuss mechanisms and dynamics underlying the evolution of plant metabolism as revealed by our studies.

Thursday, January 31st

Subset Based Assessment of High-Throughput Gene Expression Data Quality in Time Course Experiments
Dr. Julia Brettscheider
Department of Statistics, The University of Warwick, UK

Gene expression time course experiments can be used for accessing the role of genes in developmental (e.g. embryo growth), cyclic (e.g. circadian rhythms) and other processes. The key to reproducible findings are accurate and precise measurements. For short oligonucleotide microarrays, NUSE and RLE distributions have turned out to provide useful insight into outliers, biases from different study groups and artefacts caused by experimental conditions and provide a guide to removed them and prevent them in future experiments.

We shed light on the appropriate use and the interpretation of these measures in the context of time course experiments. This includes looking at the validity of assumptions underlying the use of NUSE and RLE for quality assessment. For cases where the assumptions are hurt, we develop an alternative strategy involving subset of genes judged as stable over time by suitable concepts. We further study the the dependency between quality measures measures and their dependency on preprocessing choices. Data used in this work is from short nucleotide microarrays, but most of the concepts and methods can be transferred to other platforms.

Thursday, February 7th

Mitochondrial DNA Organization in Trypanosomatid Parasites
Professor Javier Arsuaga
Department of Mathematics, San Francisco State University

Trypanosomes and leishmania, two trypanosomatid parasites, are protozoa which cause fatal diseases such as sleeping sickness, Chagas disease and Leishmaniasis. Despite their negative impact on the health and economies of many third world countries, these diseases continue to be catalogued as "neglected" by the World Health Organization. A distinctive feature of trypanosomatid parasites is that their mitochondrial DNA, known as kinetoplast DNA, is organized into thousands of minicircles and a few dozen maxicircles. Minicircles in these organisms are topologically linked, forming a gigantic chainmail-like network. The topological structure of the minicircles is of great significance both because it is species-specific and because it is a promising target for the development of drugs. However, the biophysical factors that led to the formation of the network during evolution and that maintain the network remain poorly understood.

In this talk I will discuss a mathematical and computational model that help us quantify possible growth mechanisms of the network. We find that the linking probability of two minicircles decreases linearly with the distance between the centers of the minicircles, that minicircle networks rapidly form, following a percolation-saturation pathway, when the density of minicircles is increased, and that the valence (the number of minicircles linked to any given minicircle) grows linearly with density of minicircles. Limitations and extensions of these models and results will also be discussed.

Thursday, February 14th

Applications of Metagenomic Epidemiology: From Social Networks to Childhood Leukemia
Stephen Francis
School of Public Health, UC Berkeley

The last 10 years have seen a rapid increase in the understanding of the human microbiome. Next generation sequencing and new bioinformatic tools are now allowing for the investigation and comparison of the microbiome between individuals. This presentation will discuss applications of metagenomic epidemiology: From deep sequencing of leukemic bone marrow for viral discovery to a novel study examining social contact and the transmission of commensal organisms. We will discuss the biology, technology and analytic tools that will hopefully shed light on the etiology of many complex diseases.

Thursday, February 21st

An Introduction to Brain Imaging Genetics: Bioinformatics and Biostatistical Challenges
Dr. Jean-Baptiste Poline
CEA-Neurospin and UC Berkeley

In this talk, I will first introduce the neuroimaging genetic field, and explain briefly the brain phenotypes studied and report on their heritability. I will describe some bioinformatics and statistical challenges faced by new neuroimaging genetics studies that search through a very high number of associations between brain regions and genetic variations, using the example of a large European multi center study (the IMAGEN project). I will review some current solutions proposed in the literature, show an example where multivariate techniques combined with variable selection can yield significant results when massive univariate analyses do not. Last, I will conclude with some thoughts on the reproducibility of brain imaging genetics studies.

Thursday, March 14th

Detecting Substrates of Proteolysis via SILAC Assays
Dr. William Forrest

SILAC (Stable Isotope Labeling by Amino acids in Cell culture) screens in proteomics study parallel "heavy" and "light" cell cultures. Proteins from the heavy culture contain amino acids engineered to have a higher nominal molecular weight than their counterpart amino acids in the light culture. By tracking paired discrepancies in peptide masses across varying experimental conditions, a range of questions about the fates of proteins within living cells can be explored.

We consider a case study with an NCI-60 colon cancer cell line and an isogenic line lacking Bax, a regulator of apoptosis. A SILAC assay was conducted to learn about the effects of apoptotic stimulus on proteins within the cell line both in its wild-type form and in its Bax-deficient daughter line.

We review technologies (e.g, LC-MS/MS) applied to quantify protein peptide levels. Strategies for visualization and normalization are considered. Researchers' questions are addressed using variants of established statistical tools. Links are drawn to prior approaches in molecular biology, notably cDNA microarrays.

Thursday, March 21st

Robust and Fast Linear Mixed Models for Genome-wide Association Studies
Dr. Jennifer Listgarten

Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Genome-wide associations, wherein genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be identified using a simple statistical test on each hypothesized association. However, a variety of confounding variables lie hidden in the data, leading to both spurious and missed associations if not properly addressed. My talk will focus on our recent collection of work on improving linear mixed models (LMMs) for correcting for confounding variables in GWAS. Although LMMs are among the richest class of models used today for genome-wide association studies, their use on contemporary data sets is limited because the required computations are prohibitive when many individuals are analyzed. I'll describe a new approach we've developed that scales linearly (instead of cubically) with cohort size in both run time and memory use, enabling us to analyze data sets of unprecedented size (e.g. 120,000 individuals), whereas previous approaches couldn't analyze more than 20,000 individuals. Furthermore, I'll show why the conventional use of LMMs for tackling the GWAS problem is sub-optimal, and how to improve the approach to achieve more power and better calibration of p-values.

Thursday, April 4th

Statistical Challenges in the Analysis of Resequencing Data
Professor Chiara Sabatti
Department of Statistics, Stanford University

We consider the analysis of a data set with a structure that is becoming increasingly common. A set of genomic regions, whose variation had been previously associated with a group of phenotypes, has been resequenced in approx. 6000 subjects. Investigators are interested in which one, among the many genetic variants, is functionally associated with the phenotypes. Statisticians should provide a ranking of variants in terms of their potential causal effect. We consider model selection approaches in the context of multivariate regression; a bayesian framework; and false discovery rate controlling procedures. We compare the outcomes and reflect on the advantages of the different procedures.

Thursday, April 11th

Can One Increase the Power of a GWAS Study Using a Combination of Resampling DNA Samples and Pooled DNA Genotyping?
Dr. Terry Neeman
Statistical Consulting Unit, Australian National University

Genotyping pooled DNA has been described for GWAS studies as an effective means of significantly reducing costs over individual genotyping. This strategy has the potential to detect modest associations of genomic regions in complex disorders using tens, instead of thousands of microarray or SNP-chips. In order to account for the increased noise in pooled data, one typically runs replicates and one may even do some individual genotyping on selected markers to "truth-test" the pooled sample. Today I'll review a paper that combines genotyping pooled DNA with resampling and claims to augment the power, over and above what one would get with individual genotyping. They apply this strategy (pooled/bootstrap-based GWAS or pbGWAS) to a pedigree with an E280A mutation, a fully penetrant mutation for Alzheimer's disease. The goal is to identify new loci that will influence the age of onset.

Thursday, April 18th

Bringing out New Information from Microarray Data: Quantile Inference for Differential Expression
Lorenzo Maragoni
Department of Statistical Sciences, University of Padua, Italy

With the advent of new high-throughput sequencing technologies, research on microarray data has apparently suffered from a little setback. Yet, microarray data are a gold mine of information that possibly has not yet been completely exploited. Also, the implications of routine operations, such as data transformation or corrections for multiple testing, have not yet been thoroughly investigated. Quantile-based inference methods could be helpful in providing robust and powerful statistical tools of analysis, useful for both discovering new information and taking into account the peculiar nature of microarray data. This talk introduces a novel quantile-based statistic for identification of differential expression, proves sound theoretical results behind it and opens to promising applications and possible extensions to sequencing data.

Thursday, April 25th

Efficient RNA Isoform Identification and Quantification from RNA-Seq Data with Network Flows
Dr. Laurent Jacob
Department of Statistics, UC Berkeley

Several state-of-the-art methods for isoform identification and quantification are based on sparse probabilistic models, such as Lasso regression. However, explicitly listing the -- possibly exponentially -- large set of candidate transcripts is intractable for genes with many exons. For this reason, existing approaches using sparse models are either restricted to genes with few exons, or only run the regression algorithm on a small set of pre-selected isoforms. We introduce in this paper a new technique, called FlipFlop, based on network flow optimization which can efficiently tackle the sparse estimation problem on the full set of candidate isoforms. By removing the need of preselection step, we obtain better isoform identification while keeping a low computational cost. Experiments with synthetic and real single-end RNA-Seq data confirm that our approach is more accurate than alternatives methods and one of the fastest available.

Thursday, May 2nd

Genetic Divergence and Therapy-Driven Evolution between Primary and Recurrent Glioma
Professor Joseph F. Costello
Department of Neurological Surgery, UC San Francisco


Thursday, May 9th

Integrative Cancer Genomics
Dr. Steffen Durinck

We recently published two integrative cancer genomics papers, one focused on colon cancer (Seshagiri et al. Nature 2012) and the other on small cell lung cancer (Rudin et al. Nature Genetics). In this talk I will discuss the biological findings and the integrative analysis approaches which led to these two publications.

Thursday, May 16th

Statistical Methods and Tools for Quantitative Mass Spectrometry-Based Proteomics
Professor Olga Vitek
Department of Statistics, Purdue University

Mass spectrometry-based proteomics studies proteins in complex biological mixtures. It enables a global quantification of relatively abundant proteins with liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS), and a targeted quantification of lower-abundant proteins with selected reaction monitoring (SRM). The accuracy of the workflows can be enhanced by chemical or metabolic labeling.

Stochastic variation and uncertainty are hallmarks of proteomic experiments, and statistical experimental design and analysis are key. Our goal is to develop statistical methodology for quantitative proteomics to (i) accurately quantify proteins that change in abundance between conditions, (ii) design cost-effective experiments that do not compromise the accuracy, and (iii) maximize biologically relevant interpretations. While some of these goals can be achieved with standard statistical tools, others require new and more complex solutions. This talk will discuss the methods that we recently developed, as well as the open-source software available at www.stat.purdue.edu/~ovitek.

Slides: [PDF]
Article: L. Kall and O. Vitek (2011). Computational mass spectrometry-based proteomics. PLoS Computational Biology. [PDF] [WWW]