PB HLTH 292, Section 013
Statistics and Genomics Seminar

Fall 2008

Thursday, August 28th

Silent but Not Static: Accelerated Base-pair Substitution in Silenced Chromatin of Budding Yeasts
Leonid Teytelman
Department of Molecular and Cell Biology, UC Berkeley

Many plants, fungi, pathogens, and animals have chromosome regions that are silenced. Special proteins change the chromosome structure in these domains, turning genes off or lowering their expression levels. We found an increased frequency of DNA mutations in these silenced regions of closely related yeasts. This increase is likely due to silencing proteins interfering with DNA repair or replication. Accurate replication of genetic information with minimal mutations is usually critical for the survival and fitness of an organism; however, there are known examples where a high mutation rate is beneficial. The silenced chromosome parts are often associated with virus-like transposable elements and with genes that are important in responding to environmental changes. Hence, it is possible that elevated DNA mutations in silenced regions contribute to genome defense against transposable elements or increase genetic diversity to cope with variation in surrounding conditions

Thursday, September 4th

Adaptive Evolution of Conserved Non-coding Elements
Dr. Su Yeon Kim
Department of Integrative Biology, UC Berkeley

Conservation of DNA sequences across evolutionary history is a highly informative signal for identifying regions with important biological functions. In particular, conserved non-coding regions have been shown to be good candidates for containing regulatory elements that have roles in gene regulation. Recent studies have found that there are many thousands of conserved non-coding elements (CNCs) in vertebrate genomes and have suggested possible functions for some of these elements, but the function of most CNCs remains unknown. To study the evolution of CNCs, we developed a statistical method called the "shared rates test" to identify CNCs that show changes in evolutionary rates on particular branches of the mammalian phylogenetic tree. Those rate changes may indicate changes in the function of a CNC. We applied our method to CNCs of five mammalian genomes, and found that indeed many CNCs have experienced rate changes during their evolution. We also found a subset of CNCs showing accelerations in evolutionary rate that actually exceed the neutral rates, suggesting that adaptive evolution has shaped the evolution of those elements.

Thursday, September 11th

Applications of Statistics in Examining Patient SNP Data
Houston N. Gilbert
Graduate Group in Biostatistics, UC Berkeley, and Intern, Early & Business Development, Genentech, Inc.

Patient genotype data are becoming increasingly available in many different contexts. This talk explores two such applications relating to ongoing biomarker research projects with applications in immunology and tissue growth and repair. In the first setting, a candidate gene approach is undertaken to confirm the presence of risk alleles in age-related macular degeneration (AMD) patients randomized to receive therapy with the FDA-approved treatment Lucentis. An additional goal is to identify possible genotype-treatment interactions which may identify sensitive and/or resistant population subsets. On the other side of the spectrum, we reexamine data from a genome-wide association study (GWAS) of systemic lupus erythematosus (SLE) patients. SLE is a heterogenous syndrome with various clinical presentations, and recent results (Hom et al., NEJM 2008) suggest the presence of genetically distinct subsets of patients. We examine and discuss methods which seek to use SNP data in identifying (sets of) markers that predict particular clinical and biological subphenotypes associated with SLE (e.g., presence of anti-ENA Abs, IFN signature, etc.).

(This is joint work with Jane Fridlyand, PhD, and Robert Graham, PhD, Genentech, Inc.)

Thursday, September 18th

Beyond Jurassic Park: Remixing the Primordial Soup
Professor Ian Holmes
Department of Bioengineering, UC Berkeley

Reconstructing ancient languages is both the foundation of, and an active modern research area in, computational linguistics. Similar techniques can be used in molecular evolution, to probabilistically "reconstruct" long-dead gene sequences. An advantage that biologists have over linguists is direct experimental investigation: with the commercialization of cheap DNA synthesis, hypothetical ancient genes can be brought back to life, and laboratory tests performed on them using the tools of molecular biology.

Most published work, in this field, has dug into the history of the fastest kind of molecular evolutionary event in protein gene families: "point substitution" events, where one amino acid is replaced by another. However, we now have growing bodies of statistical theory (and preliminary computational results) for undoing more complex mutation events, such as indels and structural changes, and for working with larger sequences (e.g. whole genomes). These new models allow us to peer further back in time, suggesting new experimental questions.

Ultimately, looking backwards in time leads us to the RNA world. Can we even attempt to reconstruct ancient ribozymes? What should we look for if we do?

Thursday, September 25th

Comparative genome assembly and genome-scale comparison of regulatory networks between Saccharomyces and Zygosaccharomyces yeasts.
Dr. Devin Scannell
Eisen Lab, Lawrence Berkeley National Laboratory

We report the sequencing, assembly, annotation and analysis of the closely related Zygosaccharomyces yeast genomes Z. bailii, Z. bisporus and Z. kombuchaensis. We report both methodological and biological results. First, we use a novel multi-step comparative assembly procedure which we show to be much more efficient than current strategies which aim to sequence and assemble one genome at a time. Specifically, we show that for a fixed total amount of sequence we obtain similar N50s irrespective of whether we distribute the sequence over one, two, three or four species. In doing so, we show that moderate short-read coverage (at a cost of ~$1000 per genome) is sufficient to obtain N50s of ~20Kb in yeast. Our second result pertains to the evolution of gene regulation. We have annotated and aligned our Zygosaccharomyces assemblies and used comparative genomic techniques to identify hundreds of conserved DNA elements that are involved in the regulation of gene expression. By comparing these regulatory elements to those found by performing comparative genomics among several previously sequenced Saccharomyces yeasts, we perform the first genome-scale comparison of regulatory networks in any system. We observe extensive conservation but also striking examples of lineage-specific change. Finally, because the Zygosaccharomyces diverged from the Saccharomyces just prior to a whole-genome duplication event, we show how our data can be used to understand gene regulatory network evolution after gene duplication.

Thursday, October 2nd

A Simultaneous Change-point Model for Cross Sample Analysis of DNA Copy Number
Professor Nancy R. Zhang
Department of Statistics, Stanford University

We examine the problem of detecting recurrent copy number changes in multiple samples of DNA. This can be formulated as a statistical problem of simultaneous detection of shared change-points in multiple sequences. We consider the following general statistical model. For each sequence $i = 1, \ldots, N$ and position $t = 1, \ldots, T$, the random variables $y_{it}$ are mutually independent and normally distributed with mean values $\mu_{it}$ and variances $\sigma_i^2$. The null hypothesis states that for every sample $i$, $\mu_{it}=\mu_i$ for all $t$, whereas under the alternative there exists $J \subseteq \{ 1,\dots, N\}$ and $1 \leq \tau_1 < \tau_2 \leq T$ , such that for each $i \in J$, $\mu_{it} = \mu_{i0} + \delta_i I_{\{ \tau_1 < t \leq \tau_2\}}$ where $\delta_i \neq 0.$ We propose several statistics for this testing scenario, and derive approximations for their significance level and power. We discuss computationally efficient schemes for applying these tests iteratively to biological data sets to obtain interpretable, sparse summaries. We conclude by describing the results on two biological data sets. One data set comes from normal individuals, where the main goal is to detect germline polymorphisms. The other data set comes from a cohort of cancer patients, where most of the copy number changes are somatic. We discuss the different computational and statistical issues between cancer and normal data.

This is joint work with David Siegmund, Hanlee Ji, and Jun Li.

Thursday, October 9th

Hidden Markov Model-based CpG islands definition substantially increases overlap with functional elements in the genome
Professor Rafael Irizarry
Department of Biostatistics, Johns Hopkins University

The DNA of most vertebrates is depleted in CpG dinucleotides. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI) [1]. Knowing CGI locations is important because they mark the locations of functionally relevant epigenetic marks in development and disease [4]. The most widely used list of CGI is available from the UCSC GenomeBrowser [8]. This list was derived using algorithms that search for regions satisfying the original definition of CGI [2], which imposes requirements on the CpG density, GC content, and length. However, we find two problems with this approach: 1) it leaves out CpG clusters associated with functionally important epigenetic marks and 2) it does not apply to other genomes that exhibit CpG clustering. Here, we propose a model based approach that produces a data-driven CGI definition and greatly increases overlap with promoters, alternative transcription starts, and differentially methylated regions. The model based approach has the added advantages that 1) it can be fit to any genome, 2) sensitivity and specificity are controlled via a probabilistic approach, and 3) we can use statistical inference to test for the presence of CpG clusters. We fitted our model to genomes from twelve species. As expected, there is significant evidence for CGI in vertebrates. We also find evidence of CpG clusters in the genomes of various invertebrates with much stronger evidence for A. mellifera (bee) and C. elegans (worm) than D. melanogaster (fruit fly). Because these species do not exhibit CpG depletion, we refer to the clusters as CpG Mountains (CGM). We use the general term CpG clusters (CGC) to refer to both CGI and CGM. Lists of all CGC are available at www.rafalab.org

Thursday, October 16th

Joint analysis of ChIP-seq and expression data to understand the regulatory mechanism of estrogen receptors
Dr. Xiaoyue Zhao

Estrogens play an important role in the reproductive system as well as in many important parts of the body, including breast, bone, brain and heart. They have been used extensively to treat menopausal symptoms and osteoporosis in postmenopausal women. However, several studies have indicated that hormone therapy may have adverse side effects. Estrogens exert their biological effects by interacting with two known estrogen receptors (ERs), ER alpha and ER beta. Understanding the regulatory mechanism of estrogen receptors is therefore of great importance for developing effective and safe drugs. Recent development of ChIP-chip and ChIP-seq technology allows for genome-wide mapping of transcription factor binding sites. With ChIP-seq, we have mapped the binding regions of estrogen receptors in osteosarcoma U2OS cells. Estrogen responsive elements are significantly enriched in those regions. Intersecting the DNA binding profile with the gene expression data under the same condition, we found a significantly higher proportion of regulated genes have predicted binding regions nearby compared to random genes and we also found that the additional bindings of ERs in the cells stimulated with estrogen seem to correlate well with the fold change in gene expression. With this joint analysis, we are hoping to identify the direct targets of ERs, differentiate active binding and non-active binding, and discover other important co-factors that may be involved.

Thursday, October 23rd

Studying high-throughput short read sequences with hidden Markov models for multiple processes: First results
Professor Antoine Chambaz
MAP 5, Universite Paris Descartes, and Division of Biostatistics, UC Berkeley

I will present in this talk the first results James and I obtained very recently while initiating our study of high-throughput short read sequences in the framework of hidden Markov models for multiple processes.

Joint work with James Bullard, Division of Biostatistics, UC Berkeley.

Thursday, October 30th

Defining true recurrences among ipsilateral breast tumor recurrences using DNA copy number data
Dr. Pierre Neuvial
Department of Statistics, UC Berkeley

Breast cancer patients treated with breast-conserving therapy run the risk of developing another tumor on the same breast. In this case, it is of major importance to determine whether the second cancer is a new primary (NP) or a true recurrence (TR) of the first one: a NP may be treated the same way as the primary tumor while a TR will need a more aggressive treatment.

The aim of this study is to improve on the existing distinction between NP and TR. Using DNA copy number data derived from Affymetrix SNP 50k arrays, we have developed a measure of similarity between two tumors which is based on the number of DNA copy number breakpoints they share. The resulting definition of NP and TR outperformed clinical-based definition in terms of metastasis-free survival.

Joint work with M. A. Bollet and N. Servant.

Thursday, November 6th

BayesCall: A model-based basecalling algorithm for high-throughput short-read sequencing
Wei-Chun Kao
Department of Electrical Engineering and Computer Sciences, UC Berkeley

One important goal of genetics study is to identify and characterize DNA sequence variations. To achieve this goal, it is necessary to acquire large dataset of human genome. As a matter of fact, it takes 13 years and more than 3 billion to sequence first human genome in human genome project. It is virtually impossible to obtain sufficient sequence data without advances in sequencing technology. Therefore it is important and urgent to have a low cost, high through put, and high quality sequencing technology. Recent breakthrough in sequencing technologies brings us new hopes in low cost DNA sequencing. New sequencing platforms, such as Solexa/Illumina, 454/Roche, and SOLiD/ABI, are several order of magnitude cheaper and faster than Sanger method. However, new difficulties, such as the ultra short read length and much higher error rate, also come in as a trade-off. In this talk, we will first give a brief introduction on Solexa/Illumina platform. Then we will shift our focus to novel model based basecall algorithm for it.

Joint work with Kristian Stevens, and Yun Song.

Thursday, November 13th

Virtual screening, a chemogenomics approach
Laurent Jacob
Center for Computational Biology, Mines ParisTech

Predicting interactions between small molecules and proteins is a crucial step to decipher many biological processes, and plays a critical role in drug discovery. When no detailed 3D structure of the protein target is available, ligand-based virtual screening allows the construction of predictive models by learning to discriminate known ligands from non-ligands. However, the accuracy of ligand-based models quickly degrades when the number of known ligands decreases, and in particular the approach is not applicable for orphan receptors with no known ligand. We propose a systematic method to predict ligand-protein interactions, even for targets with no known 3D structure and few or no known ligands. Following the recent chemogenomics trend, we adopt a cross-target view and attempt to screen the chemical space against whole families of proteins simultaneously. The lack of known ligand for a given target can then be compensated by the availability of known ligands for similar targets. We test this strategy on three important classes of drug targets, namely enzymes, G-protein-coupled receptors (GPCR) and ion channels, and report dramatic improvements in prediction accuracy over classical ligand-based virtual screening, in particular for targets with few or no known ligands.

Joint work with Brice Hoffmann, Veronique Stoven and Jean-Philippe Vert.

Background papers include

Thursday, November 20th

Recognizing the right partner: Chromosome dynamics during meiosis
Professor Abby Dernburg
Department of Molecular and Cell Biology, UC Berkeley

Sexual reproduction requires the conserved cell division process of meiosis. The eponymous feature of this division is the partitioning of homologous chromosomes to different daughter cells. To accomplish this separation, homologous chromosomes must first establish physical connections through a highly choreographed process of pairing, synapsis, and recombination. Our work has explored these mysterious events, primarily in the nematode Caenorhabditis elegans. We have uncovered roles for special chromosome regions known as "Pairing Centers", which we have shown to mediate chromosome dynamics by connecting to the microtubule cytoskeleton through forces spanning the nuclear envelope. Accumulating evidence indicates that meiotic cells rely on a unique checkpoint mechanism to coordinate homolog pairing and synapsis, and that this mechanism is likely to be broadly conserved among eukaryotes.

Thursday, December 4th

Screening Yeast Transcription Factors for Feedback on Gene Expression
Charles Denby
Brem Lab, Department of Molecular and Cell Biology, UC Berkeley

Direct transcriptional feedback is highly prevalent in transcriptional networks of organisms ranging from bacteria to vertebrate. In order to better characterize this phenomenon, I performed a screen to identify feedback on gene expression in a subset of saccharomyces cerevisiae transcription factors in rich and minimal media. I discovered novel feedback effects in both conditions and found evidence for environment-specific feedback. In previous efforts by another group, feedback was indirectly inferred by whole-genome transcription factor binding experiments. Results from my screen showed that I was more likely to detect feedback for a transcription factor inferred to have feedback in their study, than if they had not inferred feedback. However I detected feedback in a much larger proportion of transcription factors. I conclude that feedback is more prevalent in the yeast transcription factors than previously thought, and that key differences in my screening method enable further insight into the character of feedback.

Thursday, December 11th

Some Genomic Studies of Aging: from why exercising makes you younger to finding biomarkers of aging (using worms) to homologous patterns of aging among humans and invertebrates
Professor Alan Hubbard
Division of Biostatistics, UC Berkeley

Exercise and Aging
Human aging is associated with skeletal muscle atrophy and functional impairment. Multiple lines of evidence suggest that mitochondrial dysfunction is a major contributor to sarcopenia. We evaluated whether healthy aging was associated with a transcriptional profile reflecting mitochondrial impairment and whether resistance exercise could reverse this signature to that approximating a younger physiological age. Skeletal muscle biopsies from healthy older (N = 25) and younger (N = 26) adult men and women were compared using gene expression profiling, and a subset of these were related to measurements of muscle strength. Prior to the exercise training, the transcriptome profile showed a dramatic enrichment of genes associated with mitochondrial function with age. However, following exercise training the transcriptional signature of aging was markedly reversed back to that of younger levels for most genes that were affected by both age and exercise.

Biomarkers of Aging
Over the last several decades, there have been many attempts to identify potential biomarkers of aging. True biomarkers of aging would be useful to predict potential vulnerabilities in an individual that may arise well before they are chronologically expected, due to idiosyncratic aging rates that occur between individuals. Here, we report whole-genome expression profiles of individual wild-type Caenorhabditis elegans covering the entire nematode life span. Individual nematodes were scored for either age-related behavioral phenotypes, or survival, and then subsequently associated with their respective gene expression profiles. This facilitated the identification of transcriptional profiles that were highly associated with either physiological or chronological age. These profiles were then used to reasonably predict either age-related behavior, or the actual age of the nematodes themselves.

Conservation of aging processes between human and invertebrate species
The longevity network is comprised of 175 human homologs of proteins known to confer increased longevity though loss of function in yeast, nematode or fly, and 2,163 additional human proteins that interact with these homologs. Overall, the network consists of 3,271 binary interactions among 2,338 unique proteins. To examine the relationship of this network to human aging phenotypes, we compared the genes encoding aging network proteins to genes known to be changed transcriptionally during aging in human muscle. In the case of both the longevity protein homologs and their interactors, we observed dramatic enrichments for differentially expressed genes in the network. To determine whether homologs of human longevity interacting proteins can modulate lifespan in invertebrates, homologs of 18 human FRAP1 interacting protein showing significant changes in human aging muscle were tested for effects on nematode lifespan using RNAi. Of 18 genes tested, 33 percent extended lifespan when knocked-down in C. elegans. These observations indicate that a significant number of longevity genes identified in invertebrate models of aging have relevance to human aging. They also indicate that the longevity protein interaction network presented here is highly enriched for novel conserved longevity proteins.