PB HLTH 292, Section 013
Statistics and Genomics Seminar

Fall 2007



Thursday, August 30th

Database mining with biomaRt
Dr. Steffen Durinck
Division of Biostatistics, UC Berkeley, and Lawrence Berkeley National Laboratory

In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. In the seminar, I'll present the biomaRt package, an R package that provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R. The biomaRt package is downloadable from the Bioconductor website: www.bioconductor.org.

Slides: PDF


Thursday, September 6th

The geometry of neighbor joining and minimum evolution trees
Peter Huggins
Department of Mathematics, UC Berkeley

For distance-based phylogenetic reconstruction, neighbor joining is fast and practical, while the minimum evolution method is well-principled but hard. Although the two methods seem unrelated, recent work has shown that actually neighbor joining is a greedy heuristic for finding minimum evolution trees. Inspired by this result, we present a unified geometric framework for comparing neighbor joining and minimum evolution methods, for small numbers of taxa. Our computational results suggest new conjectures about the theoretical performance of both methods.

This talk is based on joint work with Kord Eickmeyer, Lior Pachter, and Ruriko Yoshida.


Thursday, September 20th

Ultraconserved nonsense and unproductive mRNA splicing
Liana Lareau
Department of Molecular and Cell Biology, UC Berkeley

The human and mouse genomes share a number of long, perfectly-conserved nucleotide sequences termed ultraconserved elements. While these regions can act as transcriptional enhancers when found upstream of genes, those within genes are less understood. We have found that in every gene of the human SR family of splicing regulators, ultra- or highly-conserved elements are alternatively spliced. These alternative splicing events target the resulting mRNAs for degradation via an RNA surveillance pathway called nonsense-mediated mRNA decay. Our results suggest that this unproductive splicing is important for expression of the entire SR family and illuminate one role for ultraconserved elements. Unproductive splicing seems to have evolved independently in the different SR genes, suggesting that splicing factors may readily acquire this form of regulation and providing an opportunity to examine the complex histories of ultraconserved genomic regions.


Thursday, September 27th

Evolution and Speciation in Saccharomyces yeasts - Insights Yielded By Genomic Approaches
Professor Gavin Sherlock
Department of Genetics, Stanford University

In the classical model of asexual evolution by Muller in 1932, microorganisms undergo periodic selection, by which a beneficial mutant "sweeps" over the population. These series of expansions of beneficial mutants are termed "adaptive sweeps". However, recent evidence has emerged suggesting that "adaptive sweeps" may not be complete in evolving populations, but as of yet, no population-wide experimental study has determined whether this is indeed true. To answer this question, we evolved fluorescently marked yeast strains under nutrient-limited conditions, such that we are able to track more precisely the expansion of beneficial mutantsand determine whether these "adaptive sweeps" are complete. Using genomic tools, we are also able to characterize the molecular mechanisms behind each adaptive event and use such data to reveal the fitness landscape of the evolving populations. As new species arise, they are often post-zygotically isolated, meaning that they are able to mate, but form sterile or inviable hybrids. One theory of post-zygotic speciation is the Dobzhansky-Muller model, which states genic incompatibility at two or more loci can lead to post-zygotic isolation between two species. The Dobzhansky-Muller genic incompatibility has been shown to contribute to post-zygotic isolation between species of Drosophila. However, only a few speciation genes and only one pair of interacting speciation genes have been identified to date in any species. The Saccharomyces sensu stricto group of yeasts can readily mate with each other, forming sterile hybrids (<1% viable spores), indicating that they are post-zygotically isolated. The availability of genome sequence for members of the Saccharomyces sensu stricto makes them an attractive model system for identifying speciation genes. We have taken a genomic approach to in an effort to identify such Dobzhansky-Muller genic incompatibility loci.


Thursday, October 4th

Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays
Professor Ingo Ruczinski
Department of Biostatistics, Johns Hopkins University

Many chromosomal alterations such as amplifications and deletions have been associated with disease. For the genome-wide assessment of such alterations, high-throughput single nucleotide polymorphism (SNP) arrays are often used, estimating DNA copy number and genotypes at up to 1 million loci. Hidden Markov Models (HMMs) are particularly useful for inferring chromosomal alterations from SNP array data, modeling the spatial dependence between neighboring SNPs. We propose a novel HMM for the assessment of the underlying chromosomal states, simultaneously integrating copy number estimates, genotype calls, and the corresponding measures of uncertainty when available. We also show how parent-of-origin effects can be assessed in family data, and discuss the publicly available software packages we have developed.


Thursday, October 11th

Gene Expression Network Analysis
Dr. Şerban Nacu
Genentech Inc.

We address the problem of integrating two kinds of biological data: microarray data and information on pathways and gene interactions. Microarray experiments are a rich source of gene expression measurements; standard analysis seeks individual genes that are differentially expressed. Since in practice genes are part of pathways and interact in various ways, it is desirable to look for differentially expressed groups, rather than single genes. Several recent algorithms attempt this approach, which has the potential to be more biologically meaningful and have higher statistical power than the single-gene analysis.

Following an idea of Ideker et al. (2002), we introduced the GXNA algorithm, which uses a gene interaction network to search for differentially expressed groups of related genes. GXNA has several desirable features, such as fast runtimes and the computation of significance levels adjusted for multiple testing. We give an overview of the algorithm and compare it with some of the alternatives. On several data sets related to the immunology of cancer, GXNA identifies interesting pathways that are not found by single-gene analysis.

We also give a brief overview of the publicly available software that implements the algorithm, and of ongoing work to improve its performance and user interface.

This is joint work with Susan Holmes, Rebecca Critchley-Thorne, and Peter Lee.

References

R. Critchley-Thorne, N. Yan, Ş. Nacu, S. Holmes, and P. P. Lee. Inhibition of interferon signaling in lymphocytes in metastatic melanoma patients. PLoS Medicine (2007).

T. Ideker, O. Ozier, B. Schwikowski, and A. Siegel. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics (2002).

Ş. Nacu, R. Critchley-Thorne, P. P. Lee, and S. Holmes. Gene expression network analysis, and applications to immunology. Bioinformatics (2007).


Thursday, October 18th

Statistics in drug development - from instruments, bench sciences and clinical data analysis
Dr. Isaac Cohen
Co-Founder and Chief Scientific Officer, Bionovo

An overview discussion of the statistical questions present in the following disciplines:

1. Biology- molecular biology techniques, genomics, proteomics, and metabolomics.

2. Chemistry- chemical structure identification, analytical chemistry, pharmacology and process chemistry.

3. Clinical trials- how do you answer a clinical question?


Thursday, November 1st

A Genealogical Approach to Studying Asexuality
Dr. Yun S. Song
Section of Evolution and Ecology, UC Davis

Given molecular genetic data from diploid individuals that, at present, reproduce mostly or exclusively asexually, an important problem in evolutionary biology is detecting evidence of past sexual reproduction (i.e., meiosis and mating) and recombination (both meiotic and mitotic). However, currently there is a lack of computational tools for carrying out such a study. In this talk, I will describe a method of testing asexuality by explicitly considering the evolutionary history of asexual diploid individuals.


Thursday, November 8th

Assessing disease predisposition from the genome: Implications and challenges for clinical applications
Dr. Martin G. Reese
CEO, President, and Co-Founder, Omicia

Most common human diseases are multifactorial and caused by a combination of genetic, epigenetic and environmental factors. Furthermore, they are multigenic, meaning that multiple genes or functional sites in the genome are involved. The relative contributions of these components are unknown, but heritabilities of cardiovascular disease and diabetes have been estimated to reach 60%. While heritable disease-related phenotypes and their genetic and molecular causes are catalogued in the Online Mendelian Inheritance in Man (OMIM), its organization is disease-centric, with genetic variants reported primarily as amino acid changes, e.g., ARG123CYS. To obtain a more "genome-centric" view of OMIM and to increase its usefulness to the community, we have mapped OMIM variants onto the latest assembly of the human genome sequence as nucleotide changes leading to protein changes.

I shall present in my talk

(1) A system to catalogue disease-alleles onto the human genome sequence.

(2) A comprehensive and current "human disease gene" collection, consisting of over 2,200 genes with documented associations predisposing to diverse disease phenotypes.

(3) Implications and challenges for medical applications.


Thursday, November 15th

Towards an automated screening of biorisk-associated DNA and protein sequences
Markus Fischer
Entelechon

A current problem in synthetic biology is the increasing gap between the power of synthesis methods and the ability to identify potentially threatening synthesis projects at oligo and gene synthesis facilities. The high throughput in this sector necessitates an automatic screening of incoming DNA and protein sequences and their reliable classification as either risk-associated or harmless genes. Such a classification cannot be based on the source organism alone, but needs to take into account the specific properties of individual coding sequences. It is assumed that the largest class of biorisk-associated sequences is comprised of genes that are related to the pathogenicity of their host -- so-called virulence factors. To identify virulence factors, a two component system is implemented: In a first step, a database of complete genomes of pathogens is compiled, with an interface for the annotation of virulence factors. This allows human editors to classify genes based on experimental data. A machine learning algorithm is then trained on the annotated sequences to classify new gene sequences, based on a text mining of publications that are associated with these sequences. The interaction of these two components can escalate the quality of the database: Human contributions will improve the performance of the machine learning, which in turn will provide pre-annotated data that can be edited more easily.


Thursday, November 29th

Detecting Neanderthal admixture from DNA sequence data
Professor Jeff Wall
Division of Biostatistics, UCSF

This talk discusses two different approaches for trying to estimate what contribution (if any) that Neanderthals and other 'archaic' human groups made to the current human gene pool. One method directly compares orthologous Neanderthal and modern human DNA sequences while the other indirectly analyzes contemporary human sequences only. Analyses of recent data suggest that Neanderthals made a small (but nonzero) contribution to the modern gene pool.


Thursday, December 6th

Re-Cracking the Second Genetic Code
Professor Mark Segal
Division of Biostatistics, UCSF

In a recent, widely celebrated, computational biology paper Segal et al., (Nature, 2006) provide extensive evidence supporting the existence of a second genetic code embodied in DNA. This second code pertains to the positioning of nucleosomes (the fundamental repeating subunits of all eukaryotic chromatin) which are responsible for packaging DNA into chromosomes inside the cell nucleus and controlling gene expression. Here, we re-evaluate both the basis for, and performance of, the proposed nucleosome positioning code. Tools employed in this process include the spectral envelope and discriminatory motif finding.


Thursday, December 13th

To Tree or Not To Tree: Phylogeny and Microbial Community Comparisons
Professor Rob Knight
Department of Chemistry and Biochemistry, University of Colorado, Boulder

The explosion of 16S rRNA sequence data in the public databases and the availability of high-throughput sequencing methods enables us to get a global view of microbial diversity for the first time. One key issue is how we should compare communities: should we use a similarity cutoff andtreat all taxa at that level equally, as is commonly done for macroorganisms, or should we use information about the relationships among taxa? Similarly, should we focus on presence/absence of specific lineages, or on measures that take abundance into account? In this seminar, I describe some of the advantages and disadvantages of using the phylogenetic approach, as in our UniFrac software, and show the implications of the different approaches for the clustering of samples from hundreds of diverse environments, including a range of free-living and animal-associated environments. I argue that even a very bad tree is more useful for community comparisons than the star phylogeny that tree-independent methods implicitly assume.