PB HLTH 292, Section 020
Statistics and Genomics Seminar


Spring 2011


Thursday, January 20th

Mutation, Copy Number and LOH in Cancer Genomes via Next Generation Sequencing
Dr. Peter M. Haverty
Genentech

Although previous studies have identified important common somatic mutations in lung cancers, they have primarily focused on a limited set of genes and have thus provided a constrained view of the mutational spectrum. Here we present the complete sequences of a primary lung tumour (60x coverage) and adjacent normal tissue (46x).

Comparing the two genomes, we identify a wide variety of somatic variations, including >50,000 high-confidence single nucleotide variants. While many somatic mutations with oncogenic potential were detected, we also observed a distinct pattern of selection against mutations within expressed genes, compared to non-expressed genes, and in promoter regions up to 5 kilobases upstream of all protein-coding genes. Analysis of sequencing read frequencies across the genome revealed aspects of DNA copy number alterations and Loss-of-Heterozygosity not detectable by SNP Array.



Thursday, January 27th

Using Control Genes to Correct for Unwanted Variation in Microarray Data
Johann Gagnon-Bartsch
Department of Statistics, UC Berkeley

n studies using microarray data, measured gene expression levels are associated both with factors we are interested in (e.g. treatment/control) and also with irrelevant factors (e.g. sources of systematic technical error). We would like to adjust for the unwanted variation. Several authors have proposed variants of factor analysis to identify the irrelevant factors. The main problem with this approach is that we may "over-correct" and remove some of the interesting biology. To avoid this problem we propose various methods that make use of control genes -- genes known a priori to be unassociated with the factor of interest. We present some of our methods, along with their relative strengths and weaknesses.


Thursday, February 3rd

Genome Surveillance by Small RNAs
Professor Kathleen Collins
Department of Molecular and Cell Biology, UC Berkeley

Argonaute proteins carry small RNAs (sRNAs) that confer sequence specificity for gene and genome regulation. The single-celled protozoan Tetrahymena encodes numerous Argonaute proteins exclusively of the Piwi clade otherwise found in animal germline and stem cells. Deep sequencing of Twi-bound and total sRNAs in strains disrupted for various RNA silencing machinery components revealed an unanticipated diversity of sRNA classes. Altogether, Twis distinguish sRNAs derived from loci of pseudogene families, virus-like structured RNAs, or complementary protein-coding transcripts. We are investigating the significance of maintaining these RNA 'codes' for different deleterious types of transcripts, studying their roles in determining gene expression and genome structure.


Thursday, February 17th

Correlations of ChIP-Seq Peaks and Other Genomic Signals
Professor Niels Richard Hansen
Department of Mathematical Sciences, University of Copenhagen

One main question when analyzing a positional, genomic signal, such as peaks called from ChIP-seq data, is how the signal correlates with other signals or genome annotations. Reading the literature this appears to be a non-trivial question from a methodological point of view with no existing gold standard. However, turning to the spatial statistics literature standard measures such a Ripley's K-function are found to be useful.

In this talk we will first show how to establish a useful measure of correlation closely related to Ripley's K-function and a simple, simulation based, assessment of statistical significance. Second, we ask whether we are really interested in the marginal correlations, or whether we would like to measure partial correlations instead? If so, we can in some situations establish a simple, partial correlation measure, and a similarly simple estimator, but in general we propose a model-based approach and estimation based on penalized MLE.



Thursday, February 24th

Simple Gene Estimates from RNA-Seq
Professor Elizabeth Purdom
Department of Statistics, UC Berkeley

Sequencing technology is now the platform of choice for many researchers trying to quantify expression of mRNA. Sequencing data offers a great deal of additional specificity over comparable microarray platforms. However, sequencing data data comes with a large computational overhead. Furthermore, the presence of alternative splicing, found in most higher organism, adds enormous complexity to analyzing the data, invalidating most simple methods of analyzing the sequencing data. Currently the primary approach to dealing with alternative splicing is to explicitly estimate the expression levels of individual isoforms, which requires either a known isoform annotation or one estimated from the data. In addition, the resulting estimates of isoform abundance do not provide a convenient summary of the data for further model checking or quality control. We will discuss our work in formulating estimation procedures that rely on simple count summaries of the data. In particular, we will present approaches for annotation-free estimates of gene expression levels, and if time permits touch on options for addressing alternative splicing.


Thursday, March 3rd

Modeling Diversity in Tumor Populations
Professor Rick Durrett
Department of Mathematics, Duke University

Heterogeneity of cancer cell populations makes treatment difficult because most drugs target one particular mutational change. In this talk I will discuss a branching process model in which mutations make random changes in the birth rates, in order to study the variation among the cells within a single tumor). We have results for the asymptotic growth rates of the population as well as for two commonly used measures of diversity. We get surprisingly explicit conclusion thanks to old results for one-sided stable laws.

(Joint work w/ J. Foo, K. Leder and F. Michor at the Dana Farber Cancer Institute, and former Cornell postdoc J. Mayberry now at U. of the Pacific.)



Thursday, March 10th

Functional Genomics Employing the Vertebrate Model Organism Zebrafish Danio rerio
Professor Su Guo
Department of Bioengineering and Therapeutic Sciences, UCSF

The small size, high fecundity, rapid development, and transparent nature have made zebrafish Danio rerio an important model organism for functional studies of the vertebrate genome. In this talk, I will discuss our work on: 1) functional gene discovery using induced mutations; 2) Analysis of the genome to identify tissue- and cell-type specific enhancers.


Thursday, March 31st

Gene Expression Profiles from Formalin Fixed Paraffin Embedded Breast Cancer Tissue Are Largely Comparable to Fresh Frozen Matched Tissue
Dr. Lorenza Mittempergher
Departments of Pathology and Laboratory Medicine, UCSF

Formalin Fixed Paraffin Embedded (FFPE) samples represent a valuable resource for cancer research. However, the discovery and development of new cancer biomarkers often requires fresh frozen (FF) samples. Recently, the Whole Genome (WG) DASL (cDNA-mediated Annealing, Selection, extension and Ligation) assay was specifically developed to profile FFPE tissue. However, a thorough comparison of data generated from FFPE RNA and Fresh Frozen (FF) RNA using this platform is lacking. To this end we profiled, in duplicate, 20 FFPE tissues and 20 matched FF tissues and evaluated the concordance of the DASL results from FFPE and matched FF material. We show that after proper normalization, all FFPE and FF pairs exhibit a high level of similarity (Pearson correlation > 0.7), significantly larger than the similarity between non-paired samples. Interestingly, the probes showing the highest correlation had a higher percentage G/C content and were enriched for cell cycle genes. Predictions of gene expression signatures developed on frozen material (Intrinsic subtype, Genomic Grade Index, 70 gene signature) showed a high level of concordance between FFPE and FF matched pairs. Interestingly, predictions based on a 60 gene DASL list (best match with the 70 gene signature) showed very high concordance with the MammaPrintB. results. We demonstrate that data generated from FFPE material with the DASL assay, if properly processed, are comparable to data extracted from the FF counterpart. Specifically, gene expression profiles for a known set of prognostic genes for a specific disease are highly comparable between two conditions. This opens up the possibility of using both FFPE and FF material in gene expressions analyses, leading to a vast increase in the potential resources available for cancer research.


Thursday, April 7th

Characterizing cancer progression from the tumor cells and their microenvironment
Dr. Franck Rapaport
Computational Biology Center, Memorial Sloan Kettering Cancer Center

Cancer Progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution copy number profiling is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. We designed new methods that exploits genomic level correlations in copy number profiles to discover subsets of samples that display common CNAs. In addition to alterations in the tumor cells themselves, the host microenvironment can play an important role in tumor development. In order to dissect and model the complex and reciprocal interplay between the tumor and stromal cells of the microenvironment, we devised an experimental and computational strategy to enable the simultaneous analysis of tumor and stromal genes in metastatic tumors from three distinct microenvironments.


Thursday, April 28th

Concurrent sequencing of human cancers
Dr. Barry S. Taylor
Memorial Sloan-Kettering Cancer Center and Visiting Scientist, Helen Diller Family Comprehensive Cancer Center, UCSF

We explore diverse sequence, structural, and chemical alterations contributing to sarcomagenesis by concurrently sequencing the genome, exome, transcriptome, and cytosine methylome of two patients with primary and recurrent liposarcoma and their matched normal adipose tissues. Integrative analyses revealed a modest point mutation rate accompanied by a burden of complex structural rearrangements that occur in different patterns, but arise from a common origin and with varied consequences on their tumor transcriptomes. Liposarcoma methylomes revealed differentiation pathway alterations and both genetic and epigenetic abnormalities point to a diverse small RNA component to liposarcomagenesis. Together, cross-validating multi-modality sequencing, despite the dearth of statistical and computational methodologies to analyze across sequence types, reveals the mutational and evolutionary processes at work in liposarcoma development and progression and a definitive genetic landscape of human tumors.


Thursday, May 5th

Integrative Analysis of Many ChIP-seq and ChIP-chip Experiments
Professor Hongkai Ji
Department of Biostatistics, Johns Hopkins University

ChIP-seq and ChIP-chip are widely used to study gene regulation. In this talk, I will introduce our recent work on integrating large amounts of ChIP data in public domains for improving data analysis and making novel discoveries. I will first illustrate the value of public data by introducing how they can be used to remove systematic bias in the ChIP experiments, and showing what one can learn from exploring 2000+ publicly available human and mouse ChIP samples in our recently developed hmChIP database. Then I will introduce a hierarchical mixture model for joint peak calling from multiple ChIP experiments. This approach not only allows one to study commonality and context-dependency of protein-DNA interactions, but also creates opportunities for borrowing information across datasets to improve statistical inference of noisy data sets. It also avoids exponentially growing parameter space. Finally, I will discuss how to compare multiple ChIP-seq profiles across different biological conditions, and how to integrate the ChIP data with publicly available gene expression data.