PB HLTH 292, Section 020
Statistics and Genomics Seminar

Spring 2010



Thursday, January 21st

Promoter Annotations in D. mel.
Dr. Ben Brown
Department of Statistics, UC Berkeley

CAGE (Cap Analysis of Gene Expression) and RACE (Rapid Amplification of CDNA Ends) assays have been preformed in the model organism D. mel. Both of these assays are intended to identify the promoters of actively transcribed genes. The RACE assay is targeted: primers are constructed to detect a particular promoter, and if that promoter is active, each of the transcription start sites within it will be detected. RACE, as it has been performed by the Celniker group at LBL, is also "expression-normalized". This means that approximately the same number of reads will be sequenced for each target promoter, regardless of local transcription frequency. Hence, with RACE, the shape of the promoter, whether there is one punctate TSS, or a wide distribution of potential TSSs, is of chief interest. CAGE, on the other hand, is an unbiased exploratory assay, and should, in theory, be proportional to location transcription frequency. In this talk I will discuss an integrative analysis of both assays, along with RNA-seq performed on the same sample. I will outline our efforts to generate a more complete annotation and classification of promoters in D. mel and present preliminary results.


Thursday, January 28th

Toward Meaningful Whole-Genome Interpretation with Open Access Tools From the Genome Commons
Dr. Reece Hart
Chief Scientist, Genome Commons, UC Berkeley

The widespread availability of personal genomic data is imminent, yet we are ill-prepared to reap the full personal, scientific, clinical, and social benefits of these data. Among the many barriers to holistic genome interpretation, four are prominent: 1) isolation of genotype and phenotype data; 2) lack of tools that are easily interoperable; 3) insufficient scientific methods for the analysis of variants; and 4) unsettled ethical and social policy issues. The Genome Commons is a nascent collaboration among faculty from UC Berkeley and UC San Francisco that is developing freely available databases, analytical tools, and scientific methods for the interpretation of human genomic data. Our collaborators bring expertise in computational biology, computer science, statistics, clinical genetics, and ethics. While scientific utility is our immediate goal, we envision that the Genome Commons will provide a foundation for clinical tools and a repository for studies of human variation. In this talk, I will introduce the Genome Commons, describe our preliminary results, and present the outlook for this project.


Thursday, February 4th

Everything you wanted to know about Illumina sequencing (but were afraid to ask)
Dr. Leath Tonkin
Manager, Vincent J. Coates Genomic Sequencing Laboratory, UC Berkeley

Illumina sequencing has taken the scientific world by storm, quickly displacing microarrays for ChIP and RNA expression, by enabling unprecedented amounts of short read sequencing data for low cost and relatively ease of preparation. I'll take you through the inner workings of the Illumina technology, from library prep to final output, discussing pitfalls that require consideration when analyzing the sequencing data.


Thursday, February 11th

Semi-supervised learning from temporal and spatial patterns of gene expression in Drosophila
Professor Alexander Schliep
Department of Computer Science, Rutgers, The State University of New Jersey

Gene expression measurements during the development of the fly Drosophila melanogaster are routinely used to find functional modules of temporally co-expressed genes. Complimentary large data sets of in situ RNA hybridization images for different stages of the fly embryo elucidate the spatial expression patterns.

Using a semi-supervised approach, constrained clustering with mixture models, we can find clusters of genes exhibiting spatio-temporal similarities in expression, or syn-expression. The temporal gene expression measurements are taken as primary data for which pairwise constraints are computed in an automated fashion from raw in situ images without the need for manual annotation. We investigate the influence of these pairwise constraints in the clustering and discuss the biological relevance of our results.

The advantage of automatic analysis of images is two-fold: it is more efficient than manual annotation with anatomical terms describing territories of gene expression and it can handle gradients.


Thursday, February 18th

Data mining with biomaRt
Dr. Steffen Durinck
Life Sciences Division, Lawrence Berkeley National Laboratory

Comprehensive analysis of data generated from high-throughput biological experiments, involves integration of a variety of information that can be retrieved from public databases. A simple example is to annotate a set of features that are found differentially expressed in a microarray experiment with corresponding gene symbols and genomic locations. Most public databases provide access to their data via web browsers. However, a major remaining bioinformatics challenge is how to efficiently have access to this biological data from within a data analysis environment. BioMart is a generic, query oriented data management system, capable of integrating distributed data resources. It is developed at the European Bioinformatics Institute (EBI) and Cold Spring Harbour Laboratory (CSHL). biomaRt is a software package aimed at integrating data from BioMart systems into R, enabling biological database mining.

Slides: PDF


Thursday, February 25th

Using data driven models to reveal structure in a metazoan transcriptional network
Matthew Davis
Department of Molecular and Cell Biology, UC Berkeley

Gene expression in the early embryo of the fruit fly Drosophila melanogaster is governed primarily by transcriptional regulation. Though hundreds of embryonically transcribed genes are known to be regulated in complex spatiotemporal patterns, only a handful of examples exist to illustrate the regulatory structure underlying these patterns. My research explores a large spatiotemporal atlas of gene expression with a combination of models to learn gene expression relationships in the fly embryo. I will summarize my work to date, discuss some predictions currently being tested, and discuss the development of a general model to predict gene expression given different DNA sequence contexts.


Thursday, March 4th

DNA unknotting, topoisomerases and bacteriophages
Professor Mariel Vazquez
Department of Mathematics, San Francisco State University

Type II topoisomerases simplify DNA knots and links efficiently by performing strand-passage on DNA strands. Experimental studies have shown that these enzymes simplify the topology of DNA very efficiently, however the key to this efficiency is yet to be revealed. Motivated by these experimental observations, we study random transitions of knotted polygonal chains of fixed length. We use Monte Carlo computer simulations and computational knot theory methods to model strand-passage, with and without topological biases. Unknotting patterns can assist with knot identification. We propose to apply these methods in the study of the DNA knots extracted from bacteriophage P4 capsids.

This project is funded by NIH MBRS SCORE grant S06 GM052588.


Thursday, March 11th

Structured priors for supervised learning in computational biology
Dr. Laurent Jacob
Genentech Innovation Fellow, Center for Computational Biology and Department of Statistics, UC Berkeley

Supervised learning methods are used to build functions which accurately predict the behavior of new objects from observed data.They are therefore extremely useful in several computational biology problems, where they can exploit the increasing amount of empirical data generated by high-throughput technologies, or the accumulation of experimental knowledge in public databases.

In several cases however, the amount of training data is not sufficient to deal with the complexity of the learning problem. Fortunately this type of ill-posed problem is not new in statistics and statistical machine learning. It is classically addressed using regularization approaches, or equivalently using a prior on what the function should be like. In this work, we build on this principle and propose new regularization methods based on biological prior knowledge for each problem.

In the context of in silico vaccine and drug design, we show how using the knowledge that similar targets bind similar ligands, one can improve dramatically the prediction accuracy for the targets with little known ligands, and even make predictions for targets with no known ligand. In the context of outcome prediction from molecular data, we propose a regularization function which leads to sparse vector whose support is typically a union of potentially overlapping groups of genes defined a priori like, e.g., pathways, or a set of genes which tend to be connected to each other when a graph reflecting biological information is given.

Keywords: Virtual screening, microarray data analysis, supervisedlearning, multi-task learning, structured sparsity.


Thursday, April 1st

Statistical Modeling of RNA-Seq Data
Dr. Hui Jiang
Department of Statistics, Stanford University

In mammalian cells, RNAs can have highly similar sequences yet encode proteins with remarkably different functional roles. Isoforms of a gene are an example of a collection of such RNAs. Accumulating evidence suggests that a key factor charactering cell function in mammals is differential isoform expression. Quantifying differences in cellular abundance of isoforms is therefore of significant biological interest. Ultra High Throughput Sequencing (UHTS) is an emerging technology which promises to become as (or more) powerful, popular and cost-effective than current microarray technology for estimating gene expression, particularly at the level of isoforms. This talk will introduce a statistical model for estimating isoform abundance from UHTS data, and illustrate its intuitive minimal sufficient statistics and computationally feasible implementation. Time permitting, we will describe extensions of this work including how Fisher information can be used to quantify statistical gains from using a paired-end RNA-Seq protocol. This is joint work with Julia Salzman and Wing Hung Wong.


Thursday, April 8th

Transcriptional Profiling of the Olfactory Stem Cell
Professor John Ngai
Department of Molecular and Cell Biology, UC Berkeley

Tissue regeneration is a complex process that requires the coordination of stem cell activation, proliferation and differentiation to maintain or repair the structure. The olfactory epithelium (OE) is a sensory neuroepithelium whose constituent cell types including the olfactory sensory neurons are continuously replaced during the lifetime of the animal. Following severe injury resulting in the loss of mature cell types, the OE is repopulated through the proliferation and differentiation of adult tissue stem cells and other progenitor cells. The regenerative capacity and limited number of cell types in the OE make it an excellent model for investigating stem cell maintenance, proliferation and differentiation in vivo. Previous studies have identified the horizontal basal cell (HBC) as the multipotent neural stem cell of the OE; the molecules and pathways regulating this adult tissue stem cell are unknown, however. We have used microarray-based whole genome transcriptome profiling of FACS-purified HBCs as an approach toward identifying the genes and genes and genetic networks regulating olfactory stem cell function. In parallel we have also analyzed and identified the complement of miRNAs expressed in these cells. Together these studies are expected to provide insights into the transcriptional and post-transcriptional programs underlying the regulation of olfactory stem cell self-renewal, proliferation and differentiation.


Thursday, April 15th

Three Problems in Statistical Genomics
Professor Adam B. Olshen
Department of Epidemiology and Biostatistics and HDFC Cancer Center, University of California, San Francisco

I will discuss three problems in statistical genomics that are connected by the use of high resolution arrays. The first is discovering subtypes based on multiple types of genomic data. The second is distinguishing primary tumors from metastases utilizing copy number data. The third is estimating parent-specific copy number from SNP arrays. The resulting methodologies will be demonstrated on cancer data sets.


Thursday, April 22nd

Differential expression analysis for sequence count data
Dr. Wolfgang Huber
Group Leader, European Molecular Biology Laboratory, Heidelberg, Germany

Motivation: High throughput nucleotide sequencing provides quantitative readouts in assays for RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq), cell counting. Statistical inference of differential signal in these data needs to take into account their natural variability throughout the dynamic range. When the number of replicates is small, error modeling is needed to achieve statistical power.

Results: We propose an error model that uses the negative binomial distribution, with variance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power.

Availability: A free open-source R/Biondonductor software package, called "DESeq", is available from http://www-huber.embl.de/users/anders/DESeq


Thursday, May 6th

Post-transcriptional Regulation of microRNA Biogenesis
Professor Jun S. Song
Department of Biostatistics and Epidemiology, Institute for Human Genetics, UCSF

MicroRNAs are short non-coding RNAs which inhibit the translation of mRNAs. MicroRNAs are processed by enzymes in multiple steps, and many unknown factors may regulate the biogenesis. This talk will describe two new mechanisms that can regulate the processing of microRNAs. I will also discuss the computational challenges involved in analyzing microRNA expression profiles.