PB HLTH 292, Section 020
Statistics and Genomics Seminar

Spring 2012


Thursday, January 19th

The Landscape of Complex Transcriptome
Dr. Wu Wei
Stanford Genome Technology Center

Genome-wide pervasive transcription has been reported in many eukaryotic organisms, revealing a highly interleaved transcriptome organization that involves hundreds of non-coding RNAs. We have shown that bidirectionality is an inherent feature of promoters, where most of non-coding RNAs initiate from nucleosome-free regions (NFRs) associated with the promoters of other genes. We have further provided evidence that antisense expression initiated from bidirectional promoters enables the spreading of regulatory signals from one locus to neighbouring genes. The interleaved organization of transcription and interaction between overlapped transcripts raise a new level of regulation. In addition to complexity that exists across the genome in terms of expressed regions on each strand, there is an extensive hidden layer of complexity surrounding alternative overlapping transcripts at each individual gene in the genome. Our findings have implications for how this diverse transcript usage is beneficial to organism function.


Thursday, January 26th

A Systems Biology Approach to Inform Human Health Risk Assessment of Benzene
Dr. Reuben Thomas
Division of Biostatistics, UC Berkeley

Benzene, a ubiquitous environmental hematotoxicant, causes acute myeloid leukemia (AML) and myelodysplastic syndromes and has been associated with lymphoproliferative disorders. Through its metabolites, benzene induces multiple alterations that likely contribute to the leukemogenic process. Biological plausibility for a causal role of benzene in these diseases comes from benzene’s genotoxic effects and toxicity to hematopoietic stem cells or progenitor cells, from which leukemias arise. This is manifested as lowered blood counts (hematotoxicity), even in individuals occupationally exposed to benzene below 1 ppm (the current US permissible exposure limit). Our effort used a systems biology approach, encompassing endpoints thought to be relevant to the leukemogenic process. The analyses utilized blood samples from workers exposed to relatively low levels of benzene to examine the dose-response relationship for blood cell counts, biochemical pathways, and expression of genes in the AML pathway. Multiple statistical analyses, including both parametric and non-parametric models that make minimal assumptions about model structure, were explored. The findings revealed dose-dependent responses below 1 ppm for the AML pathway genes, at both the pathway and gene expression level. In addition, the pattern of response appears to change around the 1 ppm region. These responses are considered in view of hypothesized pathways of metabolism that may only be operative below this level. In all, these analyses aid in elucidating: 1) the adverse effects of benzene at low exposures on pathways and mechanisms relevant to hematotoxicity and leukemia; 2) susceptible populations; and 4) quantitative approaches to estimate the low-dose human health risks of benzene exposure.


Thursday, February 2nd

Analysis of DNA Methylation in the High-Throughput Sequencing Era
Meromit Singer
Computer Science Division, UC Berkeley

DNA methylation is a dynamic chemical modification that is abundant on DNA sequences and plays a central role in the regulatory mechanisms of cells. In recent years, high-throughput sequencing technologies have enabled genome-wide annotation of DNA methylation. Coupled with novel computational machinery, these developments shed new light on the biological function of this phenomenon.

In this talk we will present novel methods for the study of DNA methylation on genome-wide scale, and their use in comparative studies. We will first present a novel statistical algorithm that produces corrected site-specific methylation states, along with the annotation of unmethylated islands, given data from a cost-effective, but biased experimental method. We will then discuss the application of this method in a comparative study of genome-wide DNA methylation in three primate species: human, chimpanzee and orangutan, revealing that these species can be distinguished based on differences in their DNA methylation that are independent of the underlying DNA sequence. We will conclude with recent results from a comparative study of DNA methylation in human intragenic regions and will discuss a characterization and correction of a bias present in such analysis.


Thursday, February 9th

Group Lasso for Genomic Data
Professor Jean-Philippe Vert
Mines ParisTech and Institut Curie

The group lasso is an extension of the popular lasso regression method which allows to select predefined groups of features jointly in the context of regression or supervised classification. I will discuss two extensions of the group lasso, motivated by applications in genomic data analysis. First, I will present a new fast method for multiple change-point detection in multidimensional signals, which boils down to a group Lasso regression problem and allows to detect frequent breakpoint location in DNA copy number profiles with millions of probes. Second, I will discuss the latent group lasso, an extension of the group lasso when groups can overlap, which enjoys interesting consistency properties and can be helpful for structured feature selection in high-dimensional gene expression data analysis for cancer prognosis. (Joint work with Kevin Bleakley, Guillaume Obozinski and Laurent Jacob).


Thursday, February 16th

Approaches to Genome Annotation
Professor Mark Gerstein
Biomedical Informatics, Molecular Biophysics and Biochemistry, and Computer Science
Yale University

A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the bulk of the genome that does not code for canonical genes, concentrating on intergenic features such as TF binding sites, non-coding RNAs (ncRNAs), and pseudogenes (protein fossils). I will describe an overall framework for data integration that brings together different evidence to annotate features such as binding sites and ncRNAs. Much of this work has been carried out within the ENCODE and modENCODE projects, and I will describe my approach interchangeably both in human and various model organisms (e.g. worm). I will further explain how many different annotations can be inter-related to characterize the intergenic space, build regulatory networks, and construct predictive models of gene expression from chromatin features and the activity at binding sites.


Thursday, February 23rd

Reconstructing Sparse Genomical Signals
Dr. Or Zuk
Broad Institute

Sparse Reconstruction and Compressed Sensing are very popular tools for reconstructing signals in various domains, by utilizing the fact that many natural signals are sparse, when represented in an appropriate basis.

Although many biological signals are sparse, this framework was used in genomics only in a few cases. I will describe two applications of sequencing-based sparse reconstruction to genomics.

i. In the first, the goal is to efficiently identify the set of unknown carriers of rare alleles from a population.

ii. In the second, the goal is to reconstruct the identities and frequencies of bacterial species present in a sample.

I will describe the biological problems, the formulation as a sparse recovery problem, and the specific statistical and computational issues arising in each application.


Thursday, March 1st

Heterozygote Advantage as a Natural Consequence of Adaptation in Diploids
Dr. Philipp Messer
Department of Biology, Stanford University

Molecular adaptation is typically assumed to proceed by sequential fixation of beneficial mutations. In diploids, this picture presupposes that for most adaptive mutations, the homozygotes have a higher fitness than the heterozygotes. We show that contrary to this expectation, a substantial proportion of adaptive mutations should display heterozygote advantage. This feature of adaptation in diploids emerges naturally from the primary importance of the fitness of heterozygotes for the invasion of new adaptive mutations. We formalize this result in the framework of Fisher's influential geometric model of adaptation. We find that in diploids, adaptation should often proceed through a succession of short-lived balanced states that maintain substantially higher levels of phenotypic and fitness variation in the population compared with classic adaptive walks. In fast-changing environments, this variation produces a diversity advantage that allows diploids to remain better adapted compared with haploids despite the disadvantage associated with the presence of unfit homozygotes. The short-lived balanced states arising during adaptive walks should be mostly invisible to current scans for long-term balancing selection. Instead, they should leave signatures of incomplete selective sweeps, which do appear to be common in many species. Our results also raise the possibility that balancing selection, as a natural consequence of frequent adaptation, might play a more prominent role among the forces maintaining genetic variation than is commonly recognized.


Thursday, March 8th

Transcript-Level Differential Analysis of RNA-Seq Experiments
Professor Lior Pachter
Departments of Mathematics, Molecular and Cell Biology, and Electrical Engineering and Computer Sciences, UC Berkeley

Current RNA-Seq differential analysis methods focus on tackling one of two major challenges. The first addresses the main issue that has been studied in microarray analysis: accounting for variability across replicates in the signal from the experiments. The second is dissecting the raw data to recover information at transcript- level (as opposed to gene-level) resolution. To our knowledge, no single analysis framework has rigorously addressed both of these problems simultaneously. Most methods for estimating transcript abundances from ambiguously mapped reads are based on multinomial models and use the Expectation Maximization algorithm or combinatorial optimization methods to maximize the likelihood of the observed reads. Although in some cases these methods do address the differential analysis problem, they ignore the issue of over dispersion due to biological variability because the multinomial model implies that counts should be (approximately) distributed according to Poisson distributions. This can lead to the over-prediction of differentially abundant transcripts and high false positive rates. Attempts to estimate variability in gene expression levels across replicates have focused mainly on the use of the negative binomial distribution to model fragment count variability, but they fail to account for the uncertainty in gene expression levels due to ambiguity in mapping reads among multiple splice variants of each gene.

Here, we link these two research threads by showing how to model overdispersion in the estimated number of fragments generated by each transcript across replicates. Our methods overcome significant limitations of previous methods and generate not only more accurate gene expression estimates, but enable robust transcript-level differential analysis. We explain why ambiguously mapping reads must be taken into account in differential analysis, and we show that our methods are accurate over a wide range of RNA-Seq designs, including those performed on benchtop sequencers such as the MiSeq (Illumina). We also demonstrate the effectiveness of our approach through an experiment exploring the role of a key developmental transcription factor in maintaining adult cell viability. Our methods are implemented in the freely available tool Cuffdiff.

This is joint work with Cole Trapnell, David Hendrickson, Martin Sauvageau and John Rinn.


Thursday, March 22nd

Galaxy Project: Bringing Computation Closer to Biologists
Dr. Madhavan Ganesh
Director, QB3 - Computational Genomics Resource Laboratory, UC Berkeley

With the advent of high-throughput methods, including sequencing, biological research is becoming data intensive. There are several tools that are being developed, by both the instrument vendors as well as in the open domain, to aid in the analysis of this incoming data stream. Galaxy is an open source project that allows various tools to be accessed through a graphical interface to run computational tasks and manage data. In this talk, I will look at the usefulness of Galaxy from a biologist’s perspective and some of the drawbacks.


Thursday, April 5th

Relating microRNA and mRNA Expression in Cancer Genomics
Professor Terry Speed
Department of Statistics, UC Berkeley, and Division of Bioinformatics, Walter and Eliza Hall Institute

MicroRNAs (miRNAs) are class of small non-coding RNAs (~22 nt) which normally function as negative regulators of target mRNA expression at the post-transcriptional level. They bind to the 3'UTR of target mRNAs through base pairing, resulting in target mRNA cleavage or translation inhibition. It has also recently been demonstrated that miRNAs may function as positive regulators in some cases. The human genome may encode over 1,000 miRNAs, which may target about 60% of mammalian genes and are abundant in many human cell types. The miRNAs play crucial roles in a variety of biological processes, such as cellular metabolism, cellular signaling, immune response and development, including proliferation, differentiation, and death.

Cancer is associated with very complex genetic alternations in oncogenes and tumor suppressors. Emerging evidence shows that dysfunction of miRNAs is associated with various cancers. The miRNA-mediated tumorigenesis results from either down-regulation of tumor suppressor genes or up-regulation of oncogenes. Over-expressed miRNAs (oncogenic miRNAs) could potentially target or suppress tumor suppressor genes whereas down-regulated miRNAs (tumor suppressive miRNAs) would theoretically up-regulate oncogene. This scenario suggests loss of expression of tumor-suppressor miRNAs may lead to elevated levels of the protein products of target oncogenes. Conversely, over-expression of oncogenic miRNAs may reduce the levels of protein products of target tumor-suppressor genes. Recent evidence shows that, miRNAs contribute to tumor formation and development, suggesting that miRNAs can function as oncogenes or tumor suppressors.

In cancer genomics studies, it is common to measure both mRNA and miRNA abundance, usually together with high-density SNP genotyping and perhaps also methylation and mutation calling, on scores to hundreds of tumors. MicroRNA regulation is likely to underlie much of the variation in mRNA abundance observed across tumour subtypes or between tumour and normal tissue from the same organ. Each miRNA can regulate many genes, and any given gene may be regulated by multiple miRNAs. As a result, the task of determining from mRNA and miRNA expression data which genes are regulated by which miRNAs (i.e. miRNA:mRNA pairs) is challenging. Since several computational algorithms exist that predict which genes are likely to be regulated by any given miRNA, a natural approach to finding miRNA:mRNA pairs is to predict the target genes of each miRNA, and see if these are over-represented in the genes that are differentially expressed in a given context.

There are two difficulties with this approach. First, the target prediction algorithms all have sizeable false positive and false negative rates. One can get around this to some extent by using a number of such algorithms, and restricting attention to the common predictions. But secondly, this approach makes no direct use of the miRNA abundance data. Direct correlation of each of ~20,000 mRNAs with each of ~1,000 miRNAs is not too helpful, as the sheer number of miRNA:mRNA pairs being examined raises the background noise level to the point where few genuine pairs will stand out. Our approach makes no use of the target prediction algorithms, but rather puts an equal importance on the variation of miRNA and mRNA levels across a set of tumor sample.

We begin by identifying miRNAs which are either down- or up-regulated in tumor compared to the corresponding normal tissue. These are candidate tumor suppressor or oncogenic miRNAs, respectively. We then recognise the many-to-many relationship combinations combinations of mRNAs, in the numbers of miRNAs and mRNAs we identify. Ontology categories and pathways seek to interpret our results. When we apply this strategy to the datasets on glioblastoma multiforma and serous ovarian cancer generated by The Cancer Genome Atlas, we find many known and several novel candidate oncogenic or tumor-suppressor miRNAs.

This is joint work with Erica Seoae Cho.


Thursday, April 12th

Design and Coverage of Genotyping Arrays Using Imputation and a Hybrid SNP Selection Algorithm
Professor Thomas Hoffmann
Division of Biostatistics, UCSF

Four custom Axiom genotyping arrays were designed for a genome-wide association study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. These four arrays were optimized for individuals of European, East Asian, African American, and Latino race/ethnicity. The former two arrays were designed using a greedy pairwise single nucleotide polymorphism (SNP) selection algorithm. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a hybrid SNP selection method for the latter two arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. We show coverage of the arrays on the 1000 Genomes data.


Thursday, April 19th

A Flexible Estimating Equations Approach for Mapping Function-Valued Traits
Dr. Hao Xiong
Department of Statistics, UC Berkeley

In genetic studies, many interesting traits, including growth curves and skeletal shape, have temporal or spatial structure. They are better treated as curves or function-valued traits. Identification of genetic loci contributing to such traits is facilitated by specialized methods that explicitly address the function-valued nature of the data. Current methods for mapping function-valued traits are mostly likelihood-based, requiring specification of the distribution and error structure. However, such specification is difficult or impractical in many scenarios. We propose a general functional regression approach based on estimating equations that is robust to misspecification of the covariance structure. Estimation is based on a two-step least-squares algorithm, which is fast and applicable even when the number of time points exceeds the number of samples. It is also flexible due to a general linear functional model; changing the number of covariates does not necessitate a new set of formulas and programs. In addition, many meaningful extensions are straightforward. For example, we can accommodate incomplete genotype data, and the algorithm can be trivially parallelized. The framework is an attractive alternative to likelihood-based methods when the covariance structure of the data is not known. It provides a good compromise between model simplicity, statistical efficiency, and computational speed. We illustrate our method and its advantages using circadian mouse behavioral data.


Thursday, April 26th

Optimized Oligomer Libraries to Screen Short Synthetic Enhancers in Vivo
Dr. Samantha Riesenfeld
Gladstone Institutes, UC San Francisco

Large-scale annotation efforts have greatly improved our ability to predict regulatory elements throughout the genome. However, it is still not well understood how short, transcription-factor-binding sequences combine to specify complex spatiotemporal patterns. Many top-down approaches, e.g., ChIP-Seq, that are used to predict vertebrate regulatory sequences have drawbacks, such as providing only coarse locations of functional sequences. To address these challenges, we developed a novel bottom-up approach to reverse-engineer regulatory elements and learn how short sequences cooperate to regulate genes. We systematically screened the regulatory potential of all 6-bp sequences in vivo and used the results to design synthetic tissue-specific enhancers. This talk will focus on the combinatorial theory and computational methods that we developed to construct an ultra-compact library of DNA oligomers, which were then screened in 15 tissues and two time points by zebrafish transgenics. I will also highlight a few experimental results and challenges in their statistical analysis.
This work is a collaboration between the Pollard and Ahituv labs at UCSF.


Thursday, May 3rd

Automated RNA Structure Characterization from High-Throughput Chemical Mapping Experiments
Dr. Sharon Aviran
Department of Mathematics, UC Berkeley

New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. This highlights a growing need to develop technologies that enable rapid and accurate characterization of structural features within complex RNA populations. Yet, available structure characterization techniques that are reliable are also vastly limited by technological constraints, while the accuracy of popular computational methods is generally poor. These limitations thus pose a major barrier to comprehensive determination of structure from sequence and thereby to the development of mechanistic understanding of gene regulatory dynamics. To address this need, we have developed a high-throughput structure characterization assay, called SHAPE-Seq, which simultaneously measures quantitative nucleotide-resolution structural information for hundreds of distinct RNA molecules. SHAPE-Seq combines the novel selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) chemical mapping technique with next-generation sequencing of its products. We then extract the desired structural information from the sequenced reads using a fully automated algorithmic pipeline that we developed. In this talk, I will review recent developments in RNA structure characterization as well as key advances in sequencing technologies. I will then present SHAPE-Seq, focusing on its automated data analysis methodology, which relies on a novel probabilistic model of the SHAPE-Seq experiment, adjoined by maximum likelihood estimation. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.


Thursday, May 10th

Analyzing Human Illumina 450k Methylation Arrays
Dr. Alicia Ohslack
Bioinformatics, Murdoch Childrens Research Institute, Australia

DNA methylation is the most widely studied epigenetic mark. Methylation is essential for normal development and disruption of normal epigenetic patterns has been associated with numerous diseases. Recently, Illumina has developed a microarray to measure methylation across more than 480,000 loci on the human genome. These arrays come at a fraction of the cost of other genome wide methylation platforms and therefore enable the study of the human methylome at a population scale. Hence, these arrays are being rapidly taken up by the research community. However the design of these arrays is quite unique and the best methods for analysis have not yet been identified. Here we present a new method to normalize these arrays and show that it improves results in a variety of ways. We will also briefly outline our current best practice analysis pipeline for analyzing these arrays and point out several areas of analysis where new statistical methods need to be developed.