PB
HLTH 292, Section 020
Statistics and Genomics Seminar
Spring
2011
Thursday, January 20th
Mutation, Copy Number and LOH in
Cancer Genomes via Next Generation Sequencing
Dr. Peter M. Haverty
Genentech
Although previous studies have identified important common somatic
mutations in lung cancers, they have primarily focused on a limited
set of genes and have thus provided a constrained view of the
mutational spectrum. Here we present the complete sequences of a
primary lung tumour (60x coverage) and adjacent normal tissue
(46x).
Comparing the two genomes, we identify a wide variety of somatic
variations, including >50,000 high-confidence single nucleotide
variants. While many somatic mutations with oncogenic potential were
detected, we also observed a distinct pattern of selection against
mutations within
expressed genes, compared to non-expressed genes, and in promoter
regions up to 5 kilobases upstream of all protein-coding genes.
Analysis of sequencing read frequencies across the genome revealed
aspects of DNA copy number alterations and Loss-of-Heterozygosity not
detectable by SNP Array.
Thursday, January 27th
Using Control Genes to Correct for
Unwanted Variation in Microarray Data
Johann Gagnon-Bartsch
Department of Statistics, UC Berkeley
n studies using microarray data,
measured gene expression levels are associated both with factors we
are interested in (e.g. treatment/control) and also with irrelevant
factors (e.g. sources of systematic technical error). We would like to
adjust for the unwanted variation. Several authors have proposed
variants of factor analysis to identify the irrelevant factors. The
main problem with this approach is that we may "over-correct" and
remove some of the interesting biology. To avoid this problem we
propose various methods that make use of control genes -- genes known a priori to be unassociated with the factor of interest. We present some of our methods, along with their relative strengths and weaknesses.
Thursday, February 3rd
Genome Surveillance by Small RNAs
Professor Kathleen Collins
Department of Molecular and Cell Biology, UC Berkeley
Argonaute proteins carry small RNAs
(sRNAs) that confer sequence specificity for gene and genome
regulation. The single-celled protozoan Tetrahymena encodes numerous
Argonaute proteins exclusively of the Piwi clade otherwise found in
animal germline and stem cells. Deep sequencing of Twi-bound and
total sRNAs in strains disrupted for various RNA silencing machinery
components revealed an unanticipated diversity of sRNA
classes. Altogether, Twis distinguish sRNAs derived from loci of
pseudogene families, virus-like structured RNAs, or complementary
protein-coding transcripts. We are investigating the significance of
maintaining these RNA 'codes' for different deleterious types of
transcripts, studying their roles in determining gene expression and
genome structure.
Thursday, February 17th
Correlations of ChIP-Seq Peaks and
Other Genomic Signals
Professor Niels Richard Hansen
Department of Mathematical Sciences, University of Copenhagen
One main question when analyzing a positional, genomic signal, such
as peaks called from ChIP-seq data, is how the signal correlates
with other signals or genome annotations. Reading the literature this
appears to be a non-trivial question from a methodological point of
view
with no existing gold standard. However, turning to the spatial
statistics literature standard measures such a Ripley's K-function
are found to be useful.
In this talk we will first show how to establish a useful measure of
correlation closely related to Ripley's K-function and a simple,
simulation based, assessment of statistical significance. Second, we
ask whether we are really interested in the marginal correlations, or
whether we would like to measure partial correlations instead? If so,
we can in some situations establish a simple, partial correlation
measure,
and a similarly simple estimator, but in general we propose a
model-based
approach and estimation based on penalized MLE.
Thursday, February 24th
Simple Gene Estimates from RNA-Seq
Professor Elizabeth Purdom
Department of Statistics, UC Berkeley
Sequencing technology is now the platform of choice for many
researchers trying to quantify expression of mRNA. Sequencing data
offers a great deal of additional specificity over comparable
microarray platforms. However, sequencing data data comes with a large
computational overhead. Furthermore, the presence of alternative
splicing, found in most higher organism, adds enormous complexity to
analyzing the data, invalidating most simple methods of analyzing the
sequencing data. Currently the primary approach to dealing with
alternative splicing is to explicitly estimate the expression levels
of individual isoforms, which requires either a known isoform
annotation or one estimated from the data. In addition, the resulting
estimates of isoform abundance do not provide a convenient summary of
the data for further model checking or quality control. We will
discuss our work in formulating estimation procedures that rely on
simple count summaries of the data. In particular, we will present
approaches for annotation-free estimates of gene expression levels,
and if time permits touch on options for addressing alternative splicing.
Thursday, March 3rd
Modeling Diversity in Tumor Populations
Professor Rick Durrett
Department of Mathematics, Duke University
Heterogeneity of cancer cell
populations makes treatment difficult because most drugs target one
particular mutational change. In this talk I will discuss a branching
process model in which mutations make random changes in the birth
rates, in order to study the variation among the cells within a single
tumor). We have results for the asymptotic growth rates of the
population as well as for two commonly used measures of diversity. We
get surprisingly explicit conclusion thanks to old results for
one-sided stable laws.
(Joint work w/ J. Foo, K. Leder and F. Michor at the Dana Farber
Cancer Institute, and former Cornell postdoc J. Mayberry now at U. of
the Pacific.)
Thursday, March 10th
Functional
Genomics Employing the Vertebrate Model Organism Zebrafish Danio rerio
Professor Su Guo
Department of Bioengineering and Therapeutic Sciences, UCSF
The small size, high fecundity, rapid development, and transparent
nature have made zebrafish Danio rerio an important model organism for
functional studies of the vertebrate genome. In this talk, I will
discuss our work on: 1) functional gene discovery using induced
mutations; 2) Analysis of the genome to identify tissue- and cell-type
specific enhancers.
Thursday, March 31st
Gene Expression Profiles from
Formalin Fixed Paraffin Embedded Breast Cancer Tissue Are Largely
Comparable to Fresh Frozen Matched Tissue
Dr. Lorenza Mittempergher
Departments of Pathology and Laboratory Medicine, UCSF
Formalin Fixed Paraffin Embedded (FFPE) samples represent a valuable
resource for cancer research. However, the discovery and development
of new cancer biomarkers often requires fresh frozen (FF)
samples. Recently, the Whole Genome (WG) DASL (cDNA-mediated
Annealing, Selection, extension and Ligation) assay was specifically
developed to profile FFPE tissue. However, a thorough comparison of
data generated from FFPE RNA and Fresh Frozen (FF) RNA using this
platform is lacking. To this end we profiled, in duplicate, 20 FFPE
tissues and 20 matched FF tissues and evaluated the concordance of the
DASL results from FFPE and matched FF material. We show that after
proper normalization, all FFPE and FF pairs exhibit a high level of
similarity (Pearson correlation > 0.7), significantly larger than the
similarity between non-paired samples. Interestingly, the probes
showing the highest correlation had a higher percentage G/C content
and were enriched for cell cycle genes. Predictions of gene expression
signatures developed on frozen material (Intrinsic subtype, Genomic
Grade Index, 70 gene signature) showed a high level of concordance
between FFPE and FF matched pairs. Interestingly, predictions based on
a 60 gene DASL list (best match with the 70 gene signature) showed
very high concordance with the MammaPrintB. results. We demonstrate
that data generated from FFPE material with the DASL assay, if
properly processed, are comparable to data extracted from the FF
counterpart. Specifically, gene expression profiles for a known set of
prognostic genes for a specific disease are highly comparable between
two conditions. This opens up the possibility of using both FFPE and
FF material in gene expressions analyses, leading to a vast increase
in the potential resources available for cancer research.
Thursday, April 7th
Characterizing cancer progression
from the tumor cells and their
microenvironment
Dr. Franck Rapaport
Computational Biology Center, Memorial Sloan Kettering Cancer
Center
Cancer Progression is often driven by an accumulation of genetic
changes
but also accompanied by increasing genomic instability. These
processes
lead to a complicated landscape of copy number alterations (CNAs)
within
individual tumors and great diversity across tumor samples. High
resolution copy number profiling is being used to profile CNAs of ever
larger tumor collections, and better computational methods for
processing
these data sets and identifying potential driver CNAs are needed. We
designed new methods that exploits genomic level correlations in copy
number profiles to discover subsets of samples that display common
CNAs.
In addition to alterations in the tumor cells themselves, the host
microenvironment can play an important role in tumor development. In
order
to dissect and model the complex and reciprocal interplay between the
tumor and stromal cells of the microenvironment, we devised an
experimental and computational strategy to enable the simultaneous
analysis of tumor and stromal genes in metastatic tumors from three
distinct microenvironments.
Thursday, April 28th
Concurrent sequencing of human cancers
Dr. Barry S. Taylor
Memorial Sloan-Kettering Cancer Center and Visiting Scientist, Helen
Diller Family Comprehensive Cancer Center, UCSF
We explore diverse sequence,
structural, and chemical alterations contributing to sarcomagenesis by
concurrently sequencing the genome, exome, transcriptome, and cytosine
methylome of two patients with primary and recurrent liposarcoma and
their matched normal adipose tissues. Integrative analyses revealed a
modest point mutation rate accompanied by a burden of complex
structural rearrangements that occur in different patterns, but arise
from a common origin and with varied consequences on their tumor
transcriptomes. Liposarcoma methylomes revealed differentiation
pathway alterations and both genetic and epigenetic abnormalities
point to a diverse small RNA component to
liposarcomagenesis. Together, cross-validating multi-modality
sequencing, despite the dearth of statistical and computational
methodologies to analyze across sequence types, reveals the mutational
and evolutionary processes at work in liposarcoma development and
progression and a definitive genetic landscape of human tumors.
Thursday, May 5th
Integrative Analysis of Many ChIP-seq
and ChIP-chip Experiments
Professor Hongkai Ji
Department of Biostatistics, Johns Hopkins University
ChIP-seq and ChIP-chip are widely used
to study gene regulation. In this talk, I will introduce our recent
work on integrating large amounts of ChIP data in public domains for
improving data analysis and making novel discoveries. I will first
illustrate the value of public data by introducing how they can be
used to remove systematic bias in the ChIP experiments, and showing
what one can learn from exploring 2000+ publicly available human and
mouse ChIP samples in our recently developed hmChIP database. Then I
will introduce a hierarchical mixture model for joint peak calling
from multiple ChIP experiments. This approach not only allows one to
study commonality and context-dependency of protein-DNA interactions,
but also creates opportunities for borrowing information across
datasets to improve statistical inference of noisy data sets. It also
avoids exponentially growing parameter space. Finally, I will discuss
how to compare multiple ChIP-seq profiles across different biological
conditions, and how to integrate the ChIP data with publicly available
gene expression data.