PB HLTH 295, Section 001
Statistics and Genomics Seminar
Fall 2013
Thursday, August 29th
The Dynamical Genome: Preliminaries Toward Multi-Scale Models of Animal Physiology
Dr. Ben Brown
Lawrence Berkeley National Laboratory
Genomes are dynamic across time and space. The information they encode is dispatched differently in each organ system, tissue, and cell within an animal. In the last few years, the genomic sciences have yoked the technology to probe not just the sequence of a reference genome, but the dynamical processes via which genetic information gives rise to the astoundingly complex and diverse biology of metazoans. I will discuss a project at the forefront of modern genome dynamics, regarding transcriptomics: To probe and decipher the dynamics of the transcriptome of Drosophila melanogaster, we generated poly(A)+ RNA sequencing data from dissected organ systems, environmental perturbations, and cell lines with a cumulative depth of 44,000-fold coverage of the poly(A)+ transcriptome. We identified new genes, transcripts, and proteins, but also new transcriptional phenomena previously unobserved in invertebrates. I will highlight the major conclusions and biological insights of this study, how these have changed our view of animal genomes, and also the role of statistical analysis in this project. I will then point to some of the major outstanding statistical and computational problems in my field, and the sea change that will occur if we can conquer them. Finally, I will outline some of the major data production efforts driven by local labs that will occur in the next three to five years, which should be of substantial interest for first year students in particular.
Thursday, September 5th
Timing Chromosomal Abnormalities using Mutation Data
Professor Elizabeth Purdom
Department of Statistics, UC Berkeley
Tumors accumulate large numbers of mutations and other chromosomal abnormalities due to the breakdown in genomic repair mechanisms that is a hallmark of tumors. However, not all of these abnormalities are believed to be crucial for tumor growth and progression. One important indicator of the importance of the abnormality is the relative order in which it occurred, relative to other abnormalities. Such early events may be critical abnormalities, and possibly targets for drug treatment or early diagnosis. Outside of animal models, we generally will not have tumors from multiple time points in the progression of the tumor, but rather only at the time point at which the tumor was removed. Therefore we cannot directly observe the temporal ordering of genomic abnormalities.
However, the distribution of allele frequencies within regions with copy number aberrations provides information about when the chromosomal abnormality occurred, relative to other abnormalities in the tumor. Using sequencing data, we develop a probabilistic model for the observed allele frequency of a mutation (defined as the proportion of the number of reads covering the nucleotide position that contain the mutation) that allows us to order abnormalities within a tumor. Our method gives a novel insight into the biology of tumor progression through a quantitative evaluation of temporal ordering of chromosomal abnormalities. Moreover it gives a quantitative measure to compare across samples for highlighting driver mutations and events.
Thursday, September 12th
Assigning Statistical Significance in High-Dimensional Problems
Professor Peter Buhlmann
Department of Mathematics, ETH Zurich, Switzerland
High-dimensional data, where the number of variables is much larger than
sample size, occur in many applications nowadays. During the last decade,
remarkable progress has been achieved in terms of point estimation and
computation. However, one of the core statistical tasks, namely to quantify
uncertainty or to assign statistical significance, is still in its infancy
for many problems and models. We present examples from genomics (motif
regression, gene-phenotype associations), two approaches for assigning
significance and confidence, and aspects of corresponding statistical theory.
Thursday, September 19th
Investigating the Molecular Basis of Neuronal Circuit Formation
Dr. Woj M. Wojtowicz
Bowes Research Fellow, Department of Molecular and Cell Biology, UC Berkeley
The human brain is comprised of some 1012 neurons that form a precise circuit of some estimated 1015 synaptic connections. Proper wiring requires neurites to first navigate their way through a crowded spaghetti-like milieu of neuronal processes to their correct target region and then, once there, identify their appropriate synaptic partners. How these extraordinary processes are accomplished provides a fascinating problem in molecular recognition. We are studying this process in a region of the mouse retina called the inner plexiform layer (IPL) where three major classes of neurons (bipolar, amacrine and retinal ganglion cells) comprising ~60 different neuronal subtypes form connections. The IPL is a laminated structure and different subtypes of neurons grow to distinct layers and form stereotyped connections within them. As we rely on differences in physical traits to recognize one another, neurons perform this recognition on a molecular level utilizing differentially-expressed proteins. When navigating neurites encounter one another during development, if they express cognate ligand-receptor proteins, they engage in transient physical interactions - or what might be thought of as a molecular conversation. This conversation provides instructions to neurites which are translated into directed growth (attractive or repulsive) or the assembly of synaptic structures. We are investigating these molecular conversations using bioinformatics, molecular biology, biochemistry, cell biology and bioengineering with the goal of understanding how neuronal guidance, targeting and synaptic specificity are achieved.
Thursday, September 26th
Characterizing the Genetic Basis of Transcriptome Diversity through RNA-Sequencing
Alexis Battle
Department of Computer Science, Stanford University
Understanding the consequences of regulatory variation in the human genome remains a major challenge, with important implications for understanding gene regulation and interpreting the many disease-risk variants that fall outside of protein-coding regions. Here, we provide a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals. We present a comprehensive description of the distribution of regulatory variation – by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants. We detect variants influencing expression of over ten thousand genes, and through the enhanced resolution offered by RNA-sequencing, we identify thousands of variants associated with specific phenotypes including splicing and allelic expression. Evaluating the effects of both long-range intra- chromosomal and trans (cross-chromosomal) regulation, we observe modularity in the regulatory network, with three-dimensional chromosomal configuration playing a particular role in regulatory modules within each chromosome. Further, generalizing beyond observed variants, we have analyzed the genomic properties of variants affecting both expression and splicing, and developed a Bayesian model to predict regulatory consequences of novel variants, applicable to the interpretation of individual genomes and disease studies. Finally, this cohort was interviewed extensively to record medical, behavioral, and environmental variables, offering an opportunity to study their effects at a large scale. We have explored the impact of these environmental factors on transcriptional phenotypes, in addition to their relationship with regulatory variation, observing broad changes correlated with time of day, substance use, and medication, including changes in pathways relevant to disease risk. Together, these results represent a critical step toward characterizing the complete landscape of human regulatory variation.
Thursday, October 3rd
Gene Isoform Identification of Human ESC Transcriptome by Second/Third Generation Sequencing
Dr. Kin Fai Au
Department of Statistics, Stanford University
Although transcriptional and post-transcriptional events are detected in RNA-seq data from second-generation sequencing (SGS), full-length mRNA isoforms are not captured. On the other hand, third generation sequencing (TGS), which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine SGS and TGS with a custom-designed method, IDP, for isoform identification and quantification to generate a high confidence isoform data set for human embryonic stem cells (hESC). We report 8,084 RefSeq-annotated isoforms detected as full length, and 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.
Thursday, October 10th
The Extent and Impact of Rare Non-Coding Variants in Humans
Professor Stephen B. Montgomery
Stanford University School of Medicine
Recent and rapid human population expansion has led to an excess of rare genetic variants that are expected to contribute to an individual’s genetic burden of disease risk. To date, large-scale exome sequencing studies have highlighted the abundance of rare and deleterious variants within protein-coding sequences. However, in addition to protein-coding variants, rare non-coding variants are likely to be enriched in functional consequences. I will discuss our effort to characterize the impact of rare non-coding variation in a large human family and an isolated population. Further, I will discuss our effort to understand the systemic (multi-tissue) impact of highly-deleterious coding variants (or variants of unknown significance). To address this, we have developed a multiplex, microfluidics-based method for assessing the interaction of regulatory variation on deleterious protein-coding alleles identified through exome sequencing. Finally, I will discuss our efforts to understand rare and common regulatory variants underlying complex disease and will highlight new analytical approaches for the analysis of RNA sequencing data that we have applied to understanding cardiovascular and lung disease.
Thursday, October 17th
Improved Performance Evaluation of DNA Copy Number Analysis Methods in Cancer Studies
Dr. Pierre Neuvial
Laboratoire Statistique et Genome, University d'Evry Val d'Essonne, UMR CNRS 8071 -- USC INRA, France
Changes in DNA copy numbers are a hallmark of cancer cells.
Therefore, the accurate detection and interpretation of such changes
are two important steps toward improved diagnosis and treatment. The
analysis of copy number profiles measured from high-throughput
technologies such as SNP microarray and DNAseq data raises a number of
statistical and bioinformatic challenges. Evaluating existing
analysis methods is particularly challenging in the absence of gold
standard data sets.
We have designed and implemented a framework to generate realistic DNA
copy number profiles of cancer samples with known parent-specific
copy-number state. This talk illustrates some of the benefits of this
approach in a practical use case: a comparison study between methods
for segmenting SNP array data into regions of constant parent-specific
copy number. This study helps identifying the pros and cons of the
compared methods in terms of biologically informative parameters, such
as the signal length, the number of breakpoints, the fraction of tumor
cells in the sample, or the chip type.
Thursday, October 24th
The Role of Spike-In Standards in the Normalization of RNA-Seq
Dr. Davide Risso
Department of Statistics, UC Berkeley
Normalization of RNA-Seq data has proven to be an essential step to
ensure accurate inference of expression levels, by correcting for
sequencing depth and other distributional differences within and
between replicate samples. Recently, the External RNA Control
Consortium (ERCC) has developed a set of 92 synthetic spike-in
standards that are now commercially available and relatively easy to
add to a standard library preparation.
In this talk, we evaluate the performance of the ERCC spike-ins and we
investigate the possibility of directly using spike-in
expression measures to normalize the data. We show that although
spike-in standards are a useful resource for evaluating accuracy in
RNA-Seq experiments, their expression measures are not stable enough
to be used to estimate even a global scaling parameter to normalize
the data.
We propose a novel normalization strategy that aims at removing
unwanted variation from the data by performing a factor analysis on a
suitable set of control genes and that can exploit spike-in controls
when they are present in the library, without relying exclusively on
them. Our novel approach leads to more accurate estimates of
expression fold-changes and tests for differential expression,
compared with state-of-the-art normalization methods.
Thursday, November 7th
GC-Content Bias in RNA-seq: A Single Base Model
Dr. Yuval Benjamini
Department of Statistics, Stanford University
GC content bias is a primary confounder of sequencing analysis in both DNA-seq and ChIP-seq. This bias describes the varying coverage rates associated with local number of G and C bases, both across regions and between technical replicates. RNA-seq defers in two important ways: there are many additional biases introduced by the biological pipeline, and there is not obvious "background" on which to estimate the bias.
We propose a refined model for the GC bias in RNA-seq that can be fit on within-transcript coverage variability. We compare our approach to other models for bias in RNA-seq, and discuss implications for differential-expression testing.
This is joint work with Davide Risso and Terry Speed.
Thursday, November 21st
Signatures of Error-Prone Polymerase Activity in Human Genomic Variation Data
Kelley Harris
Department of Mathematics, UC Berkeley
About 2% of human genetic polymorphisms have been hypothesized to arise via multinucleotide mutations (MNMs), complex events that generate SNPs at multiple sites in a single generation. MNMs have the potential to accelerate the pace at which single genes evolve and to confound studies of demography and selection that assume all SNPs arise independently. However, little is known about the mechanisms that govern where and when MNMs arise. In this paper, we examine clustered mutations that are segregating in human whole-genome sequencing data and demonstrate the presence of MNMs using multiple lines of evidence. We estimate the percentage of linked SNP pairs that were generated by simultaneous mutation as a function of the distance between the affected sites and show that the multinucleotide mutational process generates a high percentage of transversions relative to transitions. These findings are reproducible in data from multiple sequencing platforms and cannot be attributed to sequencing errors. Among tandem mutations that occur simultaneously at adjacent sites, we find an especially skewed distribution of ancestral and derived dinucleotides, with GC -> AA, GA -> TT and their reverse complements making up 36% of the total. These same mutations dominate the spectrum of tandem mutations produced by the upregulation of low-fidelity Polymerase ζ in "mutator" strains of Saccharomyces cerevisiae. This suggests that low-fidelity DNA replication by Pol ζ is at least partly responsible for the MNMs that are segregating in the human population, and that further information about the biochemistry of MNM can be extracted from the spectrum of linked SNPs in ordinary population genomic data.
Tuesday, December 3rd
On Segmentation of DNA Copy Number Profiles
Professor Jean-Philippe Vert
Mines ParisTech and Institut Curie, Paris, France
DNA reorganization, including amplification and deletion of particular
genomic loci, is a hallmark of most cancers. Microarray- or
sequencing-based technologies now allow to capture genome-wide
profiles of DNA copy numbers, and give in particular information about
locations of DNA breakpoints. In this talk, I will discuss several
methods to identify breakpoints in noisy signals, and highlight in
particular a method involving partial expert annotation to boost the
performance of existing techniques and automatically tune the number
of breakpoints called.
Thursday, December 5th
Semi-Parametric Robust Methods for Biomarker Discovery Among Potential Confounders in Small but High Dimensional Data Sets: A Marriage of Targeted Maximum Likelihood Estimation and LIMMA
Dr. Sara Kherad
Division of Biostatistics, UC Berkeley
Exploratory analysis of high dimensional data has received much attention since the explosion of high-throughput technology allows simultaneous screening of tens of thousands of characteristics (genomics, metabolomics, proteomics, etc.). Though some of the general approaches, such as GWAS, are transferable, what has received less focus is 1) how to derive estimation of independent associations in the context of many competing causes, without resorting to a misspecified model, and 2) how to derive accurate small-sample inference when data adaptive techniques are used in this context. We present the method in the context of a study of miRNA expression for an environmental exposure. Specially, the analysis is faced with not just a large number of comparisons, but also trying to tease out of association of the expression of miRNA with an exposure apart from confounders such as age, race, smoking conditions, BMI, etc. Our goal is to propose a method that is reasonably robust in small samples, but does not rely on misspecified (arbitrary) parametric assumptions, and thus will be based on data adaptive methods. The methodology proposed is a powerful combination of existing semi-parametric statistical methods and theory, as well as a simple framework for use of commonly used empirical Bayes approaches to aid in small sample inference. We propose using targeted maximum likelihood estimation (TMLE) for estimating variable importance measures along with a general adaptation of the commonly used Limma approach, which relies on specification of the so-called influence curve of the proposed estimator. The result is a machine-based approach that can estimate independent associations in high dimensional data, but protects against the unreliability of small-sample inference that can result when using data adaptive estimation in relatively small samples.
Thursday, December 12th
Genealogies in Rapidly Adapting Populations
Professor Oskar Hallatschek
Department of Physics, UC Berkeley
The genetic diversity of a species is shaped by its recent evolutionary history and can be used to infer demographic events or selective sweeps. Most inference methods are based on the null hypothesis that natural selection is a weak or infrequent evolutionary force. However, many species, particularly pathogens, are under continuous pressure to adapt in response to changing environments. A statistical framework for inference from diversity data of such populations is currently lacking. Towards this goal, we have explored the properties of genealogies in a model of continual adaptation in asexual populations. We found that lineages trace back to a small pool of highly fit ancestors, in which almost simultaneous coalescence of more than two lineages frequently occurs. Whereas such multiple mergers are unlikely under the neutral (Kingman) coalescent, they create a unique genetic footprint in adapting populations. The site frequency spectrum of derived neutral alleles, for example, is non-monotonic and has a peak at high frequencies. We argue that multiple merger coalescents generically arise in populations that are dominated by a small pool of distinguished individuals. Beyond rapid adaptation, this occurs for instance in spatial range expansions, where it can lead to the phenomenon of gene surfing.
Largely based on
R. A. Neher and O. Hallatschek, Genealogies in rapidly adapting populations, PNAS, 110(2): 437-442 , 2013.