PB HLTH 292, Section
020
Statistics and
Genomics Seminar
Spring 2012
Thursday,
January 19th
The Landscape of Complex Transcriptome
Dr. Wu Wei
Stanford Genome Technology Center
Genome-wide pervasive transcription has been reported in many eukaryotic organisms, revealing a highly interleaved transcriptome organization that involves hundreds of non-coding RNAs. We have shown that bidirectionality is an inherent feature of promoters, where most of non-coding RNAs initiate from nucleosome-free regions (NFRs) associated with the promoters of other genes. We have further provided evidence that antisense expression initiated from bidirectional promoters enables the spreading of regulatory signals from one locus to neighbouring genes. The interleaved organization of transcription and interaction between overlapped transcripts raise a new level of regulation. In addition to complexity that exists across the genome in terms of expressed regions on each strand, there is an extensive hidden layer of complexity surrounding alternative overlapping transcripts at each individual gene in the genome. Our findings have implications for how this diverse transcript usage is beneficial to organism function.
Thursday,
January 26th
A Systems Biology Approach to Inform Human Health Risk Assessment of Benzene
Dr. Reuben Thomas
Division of Biostatistics, UC Berkeley
Benzene, a ubiquitous environmental hematotoxicant, causes acute myeloid leukemia (AML) and myelodysplastic syndromes and has been associated with lymphoproliferative disorders. Through its metabolites, benzene induces multiple alterations that likely contribute to the leukemogenic process. Biological plausibility for a causal role of benzene in these diseases comes from benzene’s genotoxic effects and toxicity to hematopoietic stem cells or progenitor cells, from which leukemias arise. This is manifested as lowered blood counts (hematotoxicity), even in individuals occupationally exposed to benzene below 1 ppm (the current US permissible exposure limit). Our effort used a systems biology approach, encompassing endpoints thought to be relevant to the leukemogenic process. The analyses utilized blood samples from workers exposed to relatively low levels of benzene to examine the dose-response relationship for blood cell counts, biochemical pathways, and expression of genes in the AML pathway. Multiple statistical analyses, including both parametric and non-parametric models that make minimal assumptions about model structure, were explored. The findings revealed dose-dependent responses below 1 ppm for the AML pathway genes, at both the pathway and gene expression level. In addition, the pattern of response appears to change around the 1 ppm region. These responses are considered in view of hypothesized pathways of metabolism that may only be operative below this level. In all, these analyses aid in elucidating: 1) the adverse effects of benzene at low exposures on pathways and mechanisms relevant to hematotoxicity and leukemia; 2) susceptible populations; and 4) quantitative approaches to estimate the low-dose human health risks of benzene exposure.
Thursday,
February 2nd
Analysis of DNA Methylation in the High-Throughput Sequencing Era
Meromit Singer
Computer Science Division, UC Berkeley
DNA methylation is a dynamic chemical modification that is abundant on DNA sequences and plays a central role in the regulatory mechanisms of cells. In recent years, high-throughput sequencing technologies have enabled genome-wide annotation of DNA methylation. Coupled with novel computational machinery, these developments shed new light on the biological function of this phenomenon.
In this talk we will present novel methods for the study of DNA methylation on genome-wide scale, and their use in comparative studies. We will first present a novel statistical algorithm that produces corrected site-specific methylation states, along with the annotation of unmethylated islands, given data from a cost-effective, but biased experimental method. We will then discuss the application of this method in a comparative study of genome-wide DNA methylation in three primate species: human, chimpanzee and orangutan, revealing that these species can be distinguished based on differences in their DNA methylation that are independent of the underlying DNA sequence. We will conclude with recent results from a comparative study of DNA methylation in human intragenic regions and will discuss a characterization and correction of a bias present in such analysis.
Thursday,
February 9th
Group Lasso for Genomic Data
Professor Jean-Philippe Vert
Mines ParisTech and Institut Curie
The group lasso is an extension of the popular lasso regression method which allows to select predefined groups of features jointly in the context of regression or supervised classification. I will discuss two extensions of the group lasso, motivated by applications in genomic data analysis. First, I will present a new fast method for multiple change-point detection in multidimensional signals, which boils down to a group Lasso regression problem and allows to detect frequent breakpoint location in DNA copy number profiles with millions of probes. Second, I will discuss the latent group lasso, an extension of the group lasso when groups can overlap, which enjoys interesting consistency properties and can be helpful for structured feature selection in high-dimensional gene expression data analysis for cancer prognosis. (Joint work with Kevin Bleakley, Guillaume Obozinski and Laurent Jacob).
Thursday,
February 16th
Approaches to Genome Annotation
Professor Mark Gerstein
Biomedical Informatics, Molecular Biophysics and Biochemistry, and Computer Science
Yale University
A central problem for 21st century science is annotating the human
genome and making this annotation useful for the interpretation of
personal genomes. My talk will focus on annotating the bulk of the
genome that does not code for canonical genes, concentrating on
intergenic features such as TF binding sites, non-coding RNAs
(ncRNAs), and pseudogenes (protein fossils). I will describe an
overall framework for data integration that brings together different
evidence to annotate features such as binding sites and ncRNAs. Much
of this work has been carried out within the ENCODE and modENCODE
projects, and I will describe my approach interchangeably both in
human and various model organisms (e.g. worm). I will further explain
how many different annotations can be inter-related to characterize
the intergenic space, build regulatory networks, and construct
predictive models of gene expression from chromatin features and the
activity at binding sites.
Thursday,
February 23rd
Reconstructing Sparse Genomical Signals
Dr. Or Zuk
Broad Institute
Sparse Reconstruction and Compressed Sensing are
very popular tools for reconstructing signals in various domains,
by utilizing the fact that many natural signals are sparse, when
represented in an appropriate basis.
Although many biological signals are sparse, this framework
was used in genomics only in a few cases. I will describe two
applications of sequencing-based sparse reconstruction to genomics.
i. In the first, the goal is to efficiently identify the set of unknown
carriers of rare alleles from a population.
ii. In the second, the goal is to reconstruct the identities and
frequencies of bacterial species present in a sample.
I will describe the biological problems, the formulation as a sparse recovery problem,
and the specific statistical and computational issues arising in each application.
Thursday,
March 1st
Heterozygote Advantage as a Natural Consequence of Adaptation in Diploids
Dr. Philipp Messer
Department of Biology, Stanford University
Molecular adaptation is typically assumed to proceed by sequential fixation of beneficial mutations. In diploids, this picture presupposes that for most adaptive mutations, the homozygotes have a higher fitness than the heterozygotes. We show that contrary to this expectation, a substantial proportion of adaptive mutations should display heterozygote advantage. This feature of adaptation in diploids emerges naturally from the primary importance of the fitness of heterozygotes for the invasion of new adaptive mutations. We formalize this result in the framework of Fisher's influential geometric model of adaptation. We find that in diploids, adaptation should often proceed through a succession of short-lived balanced states that maintain substantially higher levels of phenotypic and fitness variation in the population compared with classic adaptive walks. In fast-changing environments, this variation produces a diversity advantage that allows diploids to remain better adapted compared with haploids despite the disadvantage associated with the presence of unfit homozygotes. The short-lived balanced states arising during adaptive walks should be mostly invisible to current scans for long-term balancing selection. Instead, they should leave signatures of incomplete selective sweeps, which do appear to be common in many species. Our results also raise the possibility that balancing selection, as a natural consequence of frequent adaptation, might play a more prominent role among the forces maintaining genetic variation than is commonly recognized.
Thursday, March 8th
Transcript-Level Differential Analysis of RNA-Seq Experiments
Professor Lior Pachter
Departments of Mathematics, Molecular and Cell Biology, and Electrical Engineering and Computer Sciences, UC Berkeley
Current RNA-Seq differential analysis methods focus on tackling one of
two major challenges. The first addresses the main issue that has been
studied in microarray analysis: accounting for variability across
replicates in the signal from the experiments. The second is
dissecting the raw data to recover information at transcript- level
(as opposed to gene-level) resolution. To our knowledge, no single
analysis framework has rigorously addressed both of these problems
simultaneously. Most methods for estimating transcript abundances from
ambiguously mapped reads are based on multinomial models and use the
Expectation Maximization algorithm or combinatorial optimization
methods to maximize the likelihood of the observed reads. Although in
some cases these methods do address the differential analysis problem,
they ignore the issue of over dispersion due to biological variability
because the multinomial model implies that counts should be
(approximately) distributed according to Poisson distributions. This
can lead to the over-prediction of differentially abundant transcripts
and high false positive rates. Attempts to estimate variability in
gene expression levels across replicates have focused mainly on the
use of the negative binomial distribution to model fragment count
variability, but they fail to account for the uncertainty in gene
expression levels due to ambiguity in mapping reads among multiple
splice variants of each gene.
Here, we link these two research threads by showing how to model
overdispersion in the estimated number of fragments generated by each
transcript across replicates. Our methods overcome significant
limitations of previous methods and generate not only more accurate
gene expression estimates, but enable robust transcript-level
differential analysis. We explain why ambiguously mapping reads must
be taken into account in differential analysis, and we show that our
methods are accurate over a wide range of RNA-Seq designs, including
those performed on benchtop sequencers such as the MiSeq (Illumina).
We also demonstrate the effectiveness of our approach through an
experiment exploring the role of a key developmental transcription
factor in maintaining adult cell viability. Our methods are
implemented in the freely available tool Cuffdiff.
This is joint work with Cole Trapnell, David Hendrickson, Martin
Sauvageau and John Rinn.
Thursday,
March 22nd
Galaxy Project: Bringing Computation Closer to Biologists
Dr. Madhavan Ganesh
Director, QB3 - Computational Genomics Resource Laboratory, UC Berkeley
With the advent of high-throughput methods, including sequencing, biological research is becoming data intensive. There are several tools that are being developed, by both the instrument vendors as well as in the open domain, to aid in the analysis of this incoming data stream. Galaxy is an open source project that allows various tools to be accessed through a graphical interface to run computational tasks and manage data. In this talk, I will look at the usefulness of Galaxy from a biologist’s perspective and some of the drawbacks.
Thursday,
April 5th
Relating microRNA and mRNA Expression in Cancer Genomics
Professor Terry Speed
Department of Statistics, UC Berkeley, and Division of Bioinformatics, Walter and Eliza Hall Institute
MicroRNAs (miRNAs) are class of small non-coding RNAs (~22 nt)
which normally function as negative regulators of target mRNA expression
at the post-transcriptional level. They bind to the 3'UTR of target mRNAs
through base pairing, resulting in target mRNA cleavage or translation
inhibition. It has also recently been demonstrated that miRNAs may
function as positive regulators in some cases. The human genome may encode
over 1,000 miRNAs, which may target about 60% of mammalian genes and are
abundant in many human cell types. The miRNAs play crucial roles in a
variety of biological processes, such as cellular metabolism, cellular
signaling, immune response and development, including proliferation,
differentiation, and death.
Cancer is associated with very complex genetic alternations in oncogenes
and tumor suppressors. Emerging evidence shows that dysfunction of miRNAs
is associated with various cancers. The miRNA-mediated tumorigenesis
results from either down-regulation of tumor suppressor genes or
up-regulation of oncogenes. Over-expressed miRNAs (oncogenic miRNAs) could
potentially target or suppress tumor suppressor genes whereas
down-regulated miRNAs (tumor suppressive miRNAs) would theoretically
up-regulate oncogene. This scenario suggests loss of expression of
tumor-suppressor miRNAs may lead to elevated levels of the protein
products of target oncogenes. Conversely, over-expression of oncogenic
miRNAs may reduce the levels of protein products of target
tumor-suppressor genes. Recent evidence shows that, miRNAs contribute to
tumor formation and development, suggesting that miRNAs can function as
oncogenes or tumor suppressors.
In cancer genomics studies, it is common to measure both mRNA and miRNA
abundance, usually together with high-density SNP genotyping and perhaps
also methylation and mutation calling, on scores to hundreds of tumors.
MicroRNA regulation is likely to underlie much of the variation in mRNA
abundance observed across tumour subtypes or between tumour and normal
tissue from the same organ. Each miRNA can regulate many genes, and any
given gene may be regulated by multiple miRNAs. As a result, the task of
determining from mRNA and miRNA expression data which genes are regulated
by which miRNAs (i.e. miRNA:mRNA pairs) is challenging. Since several
computational algorithms exist that predict which genes are likely to be
regulated by any given miRNA, a natural approach to finding miRNA:mRNA
pairs is to predict the target genes of each miRNA, and see if these are
over-represented in the genes that are differentially expressed in a given
context.
There are two difficulties with this approach. First, the target
prediction algorithms all have sizeable false positive and false negative
rates. One can get around this to some extent by using a number of such
algorithms, and restricting attention to the common predictions. But
secondly, this approach makes no direct use of the miRNA abundance data.
Direct correlation of each of ~20,000 mRNAs with each of ~1,000 miRNAs is
not too
helpful, as the sheer number of miRNA:mRNA pairs being examined raises the
background noise level to the point where few genuine pairs will stand
out. Our approach makes no use of the target prediction algorithms, but rather
puts an equal importance on the variation of miRNA and mRNA levels across
a set of tumor
sample.
We begin by identifying miRNAs which are either down- or up-regulated in
tumor compared to the corresponding normal tissue. These are candidate
tumor suppressor or oncogenic miRNAs, respectively. We then recognise the
many-to-many relationship combinations combinations of mRNAs, in the
numbers of miRNAs and mRNAs we identify. Ontology categories and pathways
seek to interpret our results. When we apply this strategy to the datasets
on glioblastoma multiforma and serous ovarian cancer generated by The
Cancer Genome Atlas, we find many known and several novel candidate
oncogenic or tumor-suppressor miRNAs.
This is joint work with Erica Seoae Cho.
Thursday,
April 12th
Design and Coverage of Genotyping Arrays Using Imputation and a Hybrid SNP Selection Algorithm
Professor Thomas Hoffmann
Division of Biostatistics, UCSF
Four custom Axiom genotyping arrays were designed for a genome-wide association study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. These four arrays were optimized for individuals of European, East Asian, African American, and Latino race/ethnicity. The former two arrays were designed using a greedy pairwise single nucleotide polymorphism (SNP) selection algorithm. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a hybrid SNP selection method for the latter two arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. We show coverage of the arrays on the 1000 Genomes data.
Thursday,
April 19th
A Flexible Estimating Equations Approach for Mapping Function-Valued Traits
Dr. Hao Xiong
Department of Statistics, UC Berkeley
In genetic studies, many interesting traits, including growth curves
and skeletal shape, have temporal or spatial structure.
They are better treated as curves or function-valued traits.
Identification of genetic loci contributing to such traits is
facilitated by
specialized methods that explicitly address the function-valued nature
of the data. Current methods for mapping function-valued traits
are mostly likelihood-based, requiring specification of the
distribution and error structure. However, such specification is
difficult or impractical in many scenarios. We propose a general functional
regression approach based on estimating equations that is robust to
misspecification of the covariance structure. Estimation is based on a
two-step least-squares algorithm, which is fast and applicable
even when the number of time points exceeds the number of samples. It
is also flexible due to a general linear functional model;
changing the number of covariates does not necessitate a new set of
formulas and programs. In addition, many meaningful extensions
are straightforward. For example, we can accommodate incomplete
genotype data, and the algorithm can be trivially parallelized. The
framework is an attractive alternative to likelihood-based methods
when the covariance structure of the data is not known. It provides
a good compromise between model simplicity, statistical efficiency,
and computational speed. We illustrate our method and its
advantages using circadian mouse behavioral data.
Thursday,
April 26th
Optimized Oligomer Libraries to Screen Short Synthetic Enhancers in Vivo
Dr. Samantha Riesenfeld
Gladstone Institutes, UC San Francisco
Large-scale annotation efforts have greatly improved our ability to predict regulatory elements throughout the genome. However, it is still not well understood how short, transcription-factor-binding sequences combine to specify complex spatiotemporal patterns. Many top-down approaches, e.g., ChIP-Seq, that are used to predict vertebrate regulatory sequences have drawbacks, such as providing only coarse locations of functional sequences. To address these challenges, we developed a novel bottom-up approach to reverse-engineer regulatory elements and learn how short sequences cooperate to regulate genes. We systematically screened the regulatory potential of all 6-bp sequences in vivo and used the results to design synthetic tissue-specific enhancers. This talk will focus on the combinatorial theory and computational methods that we developed to construct an ultra-compact library of DNA oligomers, which were then screened in 15 tissues and two time points by zebrafish transgenics. I will also highlight a few experimental results and challenges in their statistical analysis.
This work is a collaboration between the Pollard and Ahituv labs at UCSF.
Thursday,
May 3rd
Automated RNA Structure Characterization from High-Throughput Chemical Mapping Experiments
Dr. Sharon Aviran
Department of Mathematics, UC Berkeley
New regulatory roles continue to emerge for both natural and engineered
noncoding RNAs, many of which have specific secondary and tertiary
structures essential to their function. This highlights a growing need to
develop technologies that enable rapid and accurate characterization of
structural features within complex RNA populations. Yet, available structure
characterization techniques that are reliable are also vastly limited by
technological constraints, while the accuracy of popular computational
methods is generally poor. These limitations thus pose a major barrier to
comprehensive determination of structure from sequence and thereby to the
development of mechanistic understanding of gene regulatory dynamics.
To address this need, we have developed a high-throughput structure
characterization assay, called SHAPE-Seq, which simultaneously measures
quantitative nucleotide-resolution structural information for hundreds of
distinct RNA molecules. SHAPE-Seq combines the novel selective 2'-hydroxyl
acylation analyzed by primer extension (SHAPE) chemical mapping technique
with next-generation sequencing of its products. We then extract the desired
structural information from the sequenced reads using a fully automated
algorithmic pipeline that we developed. In this talk, I will review recent
developments in RNA structure characterization as well as key advances in
sequencing technologies. I will then present SHAPE-Seq, focusing on its
automated data analysis methodology, which relies on a novel probabilistic
model of the SHAPE-Seq experiment, adjoined by maximum likelihood
estimation. I will demonstrate the accuracy and simplicity of our approach
as well as its applicability to a general class of chemical mapping
techniques and to more traditional SHAPE experiments that use capillary
electrophoresis to identify and quantify primer extension products.
Thursday,
May 10th
Analyzing Human Illumina 450k Methylation Arrays
Dr. Alicia Ohslack
Bioinformatics, Murdoch Childrens Research Institute, Australia
DNA methylation is the most widely studied epigenetic mark. Methylation is essential for normal development and disruption of normal epigenetic patterns has been associated with numerous diseases. Recently, Illumina has developed a microarray to measure methylation across more than 480,000 loci on the human genome. These arrays come at a fraction of the cost of other genome wide methylation platforms and therefore enable the study of the human methylome at a population scale. Hence, these arrays are being rapidly taken up by the research community. However the design of these arrays is quite unique and the best methods for analysis have not yet been identified. Here we present a new method to normalize these arrays and show that it improves results in a variety of ways. We will also briefly outline our current best practice analysis pipeline for analyzing these arrays and point out several areas of analysis where new statistical methods need to be developed.