PB HLTH 295, Section 001
Statistics and Genomics Seminar


Fall 2013


Thursday, August 29th

The Dynamical Genome: Preliminaries Toward Multi-Scale Models of Animal Physiology
Dr. Ben Brown
Lawrence Berkeley National Laboratory

Genomes are dynamic across time and space. The information they encode is dispatched differently in each organ system, tissue, and cell within an animal. In the last few years, the genomic sciences have yoked the technology to probe not just the sequence of a reference genome, but the dynamical processes via which genetic information gives rise to the astoundingly complex and diverse biology of metazoans. I will discuss a project at the forefront of modern genome dynamics, regarding transcriptomics: To probe and decipher the dynamics of the transcriptome of Drosophila melanogaster, we generated poly(A)+ RNA sequencing data from dissected organ systems, environmental perturbations, and cell lines with a cumulative depth of 44,000-fold coverage of the poly(A)+ transcriptome. We identified new genes, transcripts, and proteins, but also new transcriptional phenomena previously unobserved in invertebrates. I will highlight the major conclusions and biological insights of this study, how these have changed our view of animal genomes, and also the role of statistical analysis in this project. I will then point to some of the major outstanding statistical and computational problems in my field, and the sea change that will occur if we can conquer them. Finally, I will outline some of the major data production efforts driven by local labs that will occur in the next three to five years, which should be of substantial interest for first year students in particular.


Thursday, September 5th

Timing Chromosomal Abnormalities using Mutation Data
Professor Elizabeth Purdom
Department of Statistics, UC Berkeley

Tumors accumulate large numbers of mutations and other chromosomal abnormalities due to the breakdown in genomic repair mechanisms that is a hallmark of tumors. However, not all of these abnormalities are believed to be crucial for tumor growth and progression. One important indicator of the importance of the abnormality is the relative order in which it occurred, relative to other abnormalities. Such early events may be critical abnormalities, and possibly targets for drug treatment or early diagnosis. Outside of animal models, we generally will not have tumors from multiple time points in the progression of the tumor, but rather only at the time point at which the tumor was removed. Therefore we cannot directly observe the temporal ordering of genomic abnormalities.

However, the distribution of allele frequencies within regions with copy number aberrations provides information about when the chromosomal abnormality occurred, relative to other abnormalities in the tumor. Using sequencing data, we develop a probabilistic model for the observed allele frequency of a mutation (defined as the proportion of the number of reads covering the nucleotide position that contain the mutation) that allows us to order abnormalities within a tumor. Our method gives a novel insight into the biology of tumor progression through a quantitative evaluation of temporal ordering of chromosomal abnormalities. Moreover it gives a quantitative measure to compare across samples for highlighting driver mutations and events.


Thursday, September 12th

Assigning Statistical Significance in High-Dimensional Problems
Professor Peter Buhlmann
Department of Mathematics, ETH Zurich, Switzerland

High-dimensional data, where the number of variables is much larger than sample size, occur in many applications nowadays. During the last decade, remarkable progress has been achieved in terms of point estimation and computation. However, one of the core statistical tasks, namely to quantify uncertainty or to assign statistical significance, is still in its infancy for many problems and models. We present examples from genomics (motif regression, gene-phenotype associations), two approaches for assigning significance and confidence, and aspects of corresponding statistical theory.


Thursday, September 19th

Investigating the Molecular Basis of Neuronal Circuit Formation
Dr. Woj M. Wojtowicz
Bowes Research Fellow, Department of Molecular and Cell Biology, UC Berkeley

The human brain is comprised of some 1012 neurons that form a precise circuit of some estimated 1015 synaptic connections. Proper wiring requires neurites to first navigate their way through a crowded spaghetti-like milieu of neuronal processes to their correct target region and then, once there, identify their appropriate synaptic partners. How these extraordinary processes are accomplished provides a fascinating problem in molecular recognition. We are studying this process in a region of the mouse retina called the inner plexiform layer (IPL) where three major classes of neurons (bipolar, amacrine and retinal ganglion cells) comprising ~60 different neuronal subtypes form connections. The IPL is a laminated structure and different subtypes of neurons grow to distinct layers and form stereotyped connections within them. As we rely on differences in physical traits to recognize one another, neurons perform this recognition on a molecular level utilizing differentially-expressed proteins. When navigating neurites encounter one another during development, if they express cognate ligand-receptor proteins, they engage in transient physical interactions - or what might be thought of as a molecular conversation. This conversation provides instructions to neurites which are translated into directed growth (attractive or repulsive) or the assembly of synaptic structures. We are investigating these molecular conversations using bioinformatics, molecular biology, biochemistry, cell biology and bioengineering with the goal of understanding how neuronal guidance, targeting and synaptic specificity are achieved.


Thursday, September 26th

Characterizing the Genetic Basis of Transcriptome Diversity through RNA-Sequencing
Alexis Battle
Department of Computer Science, Stanford University

Understanding the consequences of regulatory variation in the human genome remains a major challenge, with important implications for understanding gene regulation and interpreting the many disease-risk variants that fall outside of protein-coding regions. Here, we provide a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals. We present a comprehensive description of the distribution of regulatory variation – by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants. We detect variants influencing expression of over ten thousand genes, and through the enhanced resolution offered by RNA-sequencing, we identify thousands of variants associated with specific phenotypes including splicing and allelic expression. Evaluating the effects of both long-range intra- chromosomal and trans (cross-chromosomal) regulation, we observe modularity in the regulatory network, with three-dimensional chromosomal configuration playing a particular role in regulatory modules within each chromosome. Further, generalizing beyond observed variants, we have analyzed the genomic properties of variants affecting both expression and splicing, and developed a Bayesian model to predict regulatory consequences of novel variants, applicable to the interpretation of individual genomes and disease studies. Finally, this cohort was interviewed extensively to record medical, behavioral, and environmental variables, offering an opportunity to study their effects at a large scale. We have explored the impact of these environmental factors on transcriptional phenotypes, in addition to their relationship with regulatory variation, observing broad changes correlated with time of day, substance use, and medication, including changes in pathways relevant to disease risk. Together, these results represent a critical step toward characterizing the complete landscape of human regulatory variation.


Thursday, October 3rd

Gene Isoform Identification of Human ESC Transcriptome by Second/Third Generation Sequencing
Dr. Kin Fai Au
Department of Statistics, Stanford University

Although transcriptional and post-transcriptional events are detected in RNA-seq data from second-generation sequencing (SGS), full-length mRNA isoforms are not captured. On the other hand, third generation sequencing (TGS), which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine SGS and TGS with a custom-designed method, IDP, for isoform identification and quantification to generate a high confidence isoform data set for human embryonic stem cells (hESC). We report 8,084 RefSeq-annotated isoforms detected as full length, and 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.


Thursday, October 10th

The Extent and Impact of Rare Non-Coding Variants in Humans
Professor Stephen B. Montgomery
Stanford University School of Medicine

Recent and rapid human population expansion has led to an excess of rare genetic variants that are expected to contribute to an individual’s genetic burden of disease risk. To date, large-scale exome sequencing studies have highlighted the abundance of rare and deleterious variants within protein-coding sequences. However, in addition to protein-coding variants, rare non-coding variants are likely to be enriched in functional consequences. I will discuss our effort to characterize the impact of rare non-coding variation in a large human family and an isolated population. Further, I will discuss our effort to understand the systemic (multi-tissue) impact of highly-deleterious coding variants (or variants of unknown significance). To address this, we have developed a multiplex, microfluidics-based method for assessing the interaction of regulatory variation on deleterious protein-coding alleles identified through exome sequencing. Finally, I will discuss our efforts to understand rare and common regulatory variants underlying complex disease and will highlight new analytical approaches for the analysis of RNA sequencing data that we have applied to understanding cardiovascular and lung disease.


Thursday, October 17th

Improved Performance Evaluation of DNA Copy Number Analysis Methods in Cancer Studies
Dr. Pierre Neuvial
Laboratoire Statistique et Genome, University d'Evry Val d'Essonne, UMR CNRS 8071 -- USC INRA, France

Changes in DNA copy numbers are a hallmark of cancer cells. Therefore, the accurate detection and interpretation of such changes are two important steps toward improved diagnosis and treatment. The analysis of copy number profiles measured from high-throughput technologies such as SNP microarray and DNAseq data raises a number of statistical and bioinformatic challenges. Evaluating existing analysis methods is particularly challenging in the absence of gold standard data sets.

We have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known parent-specific copy-number state. This talk illustrates some of the benefits of this approach in a practical use case: a comparison study between methods for segmenting SNP array data into regions of constant parent-specific copy number. This study helps identifying the pros and cons of the compared methods in terms of biologically informative parameters, such as the signal length, the number of breakpoints, the fraction of tumor cells in the sample, or the chip type.


Thursday, October 24th

The Role of Spike-In Standards in the Normalization of RNA-Seq
Dr. Davide Risso
Department of Statistics, UC Berkeley

Normalization of RNA-Seq data has proven to be an essential step to ensure accurate inference of expression levels, by correcting for sequencing depth and other distributional differences within and between replicate samples. Recently, the External RNA Control Consortium (ERCC) has developed a set of 92 synthetic spike-in standards that are now commercially available and relatively easy to add to a standard library preparation. In this talk, we evaluate the performance of the ERCC spike-ins and we investigate the possibility of directly using spike-in expression measures to normalize the data. We show that although spike-in standards are a useful resource for evaluating accuracy in RNA-Seq experiments, their expression measures are not stable enough to be used to estimate even a global scaling parameter to normalize the data. We propose a novel normalization strategy that aims at removing unwanted variation from the data by performing a factor analysis on a suitable set of control genes and that can exploit spike-in controls when they are present in the library, without relying exclusively on them. Our novel approach leads to more accurate estimates of expression fold-changes and tests for differential expression, compared with state-of-the-art normalization methods.


Thursday, November 7th

GC-Content Bias in RNA-seq: A Single Base Model
Dr. Yuval Benjamini
Department of Statistics, Stanford University

GC content bias is a primary confounder of sequencing analysis in both DNA-seq and ChIP-seq. This bias describes the varying coverage rates associated with local number of G and C bases, both across regions and between technical replicates. RNA-seq defers in two important ways: there are many additional biases introduced by the biological pipeline, and there is not obvious "background" on which to estimate the bias.

We propose a refined model for the GC bias in RNA-seq that can be fit on within-transcript coverage variability. We compare our approach to other models for bias in RNA-seq, and discuss implications for differential-expression testing.

This is joint work with Davide Risso and Terry Speed.


Thursday, November 21st

Signatures of Error-Prone Polymerase Activity in Human Genomic Variation Data
Kelley Harris
Department of Mathematics, UC Berkeley

About 2% of human genetic polymorphisms have been hypothesized to arise via multinucleotide mutations (MNMs), complex events that generate SNPs at multiple sites in a single generation. MNMs have the potential to accelerate the pace at which single genes evolve and to confound studies of demography and selection that assume all SNPs arise independently. However, little is known about the mechanisms that govern where and when MNMs arise. In this paper, we examine clustered mutations that are segregating in human whole-genome sequencing data and demonstrate the presence of MNMs using multiple lines of evidence. We estimate the percentage of linked SNP pairs that were generated by simultaneous mutation as a function of the distance between the affected sites and show that the multinucleotide mutational process generates a high percentage of transversions relative to transitions. These findings are reproducible in data from multiple sequencing platforms and cannot be attributed to sequencing errors. Among tandem mutations that occur simultaneously at adjacent sites, we find an especially skewed distribution of ancestral and derived dinucleotides, with GC -> AA, GA -> TT and their reverse complements making up 36% of the total. These same mutations dominate the spectrum of tandem mutations produced by the upregulation of low-fidelity Polymerase ζ in "mutator" strains of Saccharomyces cerevisiae. This suggests that low-fidelity DNA replication by Pol ζ is at least partly responsible for the MNMs that are segregating in the human population, and that further information about the biochemistry of MNM can be extracted from the spectrum of linked SNPs in ordinary population genomic data.


Tuesday, December 3rd

On Segmentation of DNA Copy Number Profiles
Professor Jean-Philippe Vert
Mines ParisTech and Institut Curie, Paris, France

DNA reorganization, including amplification and deletion of particular genomic loci, is a hallmark of most cancers. Microarray- or sequencing-based technologies now allow to capture genome-wide profiles of DNA copy numbers, and give in particular information about locations of DNA breakpoints. In this talk, I will discuss several methods to identify breakpoints in noisy signals, and highlight in particular a method involving partial expert annotation to boost the performance of existing techniques and automatically tune the number of breakpoints called.


Thursday, December 5th

Semi-Parametric Robust Methods for Biomarker Discovery Among Potential Confounders in Small but High Dimensional Data Sets: A Marriage of Targeted Maximum Likelihood Estimation and LIMMA
Dr. Sara Kherad
Division of Biostatistics, UC Berkeley

Exploratory analysis of high dimensional data has received much attention since the explosion of high-throughput technology allows simultaneous screening of tens of thousands of characteristics (genomics, metabolomics, proteomics, etc.). Though some of the general approaches, such as GWAS, are transferable, what has received less focus is 1) how to derive estimation of independent associations in the context of many competing causes, without resorting to a misspecified model, and 2) how to derive accurate small-sample inference when data adaptive techniques are used in this context. We present the method in the context of a study of miRNA expression for an environmental exposure. Specially, the analysis is faced with not just a large number of comparisons, but also trying to tease out of association of the expression of miRNA with an exposure apart from confounders such as age, race, smoking conditions, BMI, etc. Our goal is to propose a method that is reasonably robust in small samples, but does not rely on misspecified (arbitrary) parametric assumptions, and thus will be based on data adaptive methods. The methodology proposed is a powerful combination of existing semi-parametric statistical methods and theory, as well as a simple framework for use of commonly used empirical Bayes approaches to aid in small sample inference. We propose using targeted maximum likelihood estimation (TMLE) for estimating variable importance measures along with a general adaptation of the commonly used Limma approach, which relies on specification of the so-called influence curve of the proposed estimator. The result is a machine-based approach that can estimate independent associations in high dimensional data, but protects against the unreliability of small-sample inference that can result when using data adaptive estimation in relatively small samples.


Thursday, December 12th

Genealogies in Rapidly Adapting Populations
Professor Oskar Hallatschek
Department of Physics, UC Berkeley

The genetic diversity of a species is shaped by its recent evolutionary history and can be used to infer demographic events or selective sweeps. Most inference methods are based on the null hypothesis that natural selection is a weak or infrequent evolutionary force. However, many species, particularly pathogens, are under continuous pressure to adapt in response to changing environments. A statistical framework for inference from diversity data of such populations is currently lacking. Towards this goal, we have explored the properties of genealogies in a model of continual adaptation in asexual populations. We found that lineages trace back to a small pool of highly fit ancestors, in which almost simultaneous coalescence of more than two lineages frequently occurs. Whereas such multiple mergers are unlikely under the neutral (Kingman) coalescent, they create a unique genetic footprint in adapting populations. The site frequency spectrum of derived neutral alleles, for example, is non-monotonic and has a peak at high frequencies. We argue that multiple merger coalescents generically arise in populations that are dominated by a small pool of distinguished individuals. Beyond rapid adaptation, this occurs for instance in spatial range expansions, where it can lead to the phenomenon of gene surfing.

Largely based on R. A. Neher and O. Hallatschek, Genealogies in rapidly adapting populations, PNAS, 110(2): 437-442 , 2013.