PB HLTH 292, Fall 2006

PB HLTH 292, Section 013
Statistics and Genomics Seminar

Fall 2006

Thursday, August 31st

A statistical framework to infer functional gene associations from multiple biologically interrelated microarray experiments
Professor Haiyan Huang
Department of Statistics, UC Berkeley

Inferring functional gene relationships is a major step in understanding biological networks. With microarray data from an increasing number of biologically interrelated experiments, it now allows for more complete portrayals of functional gene relationships involved in biological processes. In current studies of gene relationships, the existence of dependencies between gene expressions from the biologically interrelated experiments, however, has been widely ignored. When not accounted for, these experimental dependencies can result in inaccurate inferences of functional gene relationships, and hence incorrect biological conclusions.

In this talk, I will introduce a statistical framework and a novel gene co-expression measure, named Knorm correlation, to address this problem. The most important aspect of the proposed model is its ability to decompose the interesting biological variations in gene expressions into two mutually independent components each arising from the genes and the experiments, in addition to variations due to random noises. As a result, the Knorm correlation can critically de-correlate the experimental dependencies before estimating the gene relationships, thus leading to improved accuracies in inferring functional gene relationships. Knorm correlation simplifies to the Pearson coefficient when experiments are uncorrelated. Using simulation studies, a yeast microarray and a human microarray dataset, we demonstrate the success of the Knorm correlation as a more accurate and reliable measure, and the adverse impact of experimental dependencies on the Pearson coefficient, in inferring functional gene relationships from interrelated and interdependent experiments.

Joint work with Siew-leng Melinda Teng (UCB) and X. Jasmine Zhou (USC).

Thursday, September 7th

Supervised detection of conserved motifs in DNA sequences with cosmo
Oliver Bembom
Graduate Group in Biostatistics, UC Berkeley

Identification of transcription factor binding sites is a major interest in contemporary biological research. A number of computational methods have been proposed to identify these regulatory motifs from a set of unaligned sequences that are thought to share the motif in question. Keles et al. (2003) introduced an algorithm called COMODE that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question.

This talk focuses on a number of methodological improvements and extensions to COMODE, relating mostly to the data-adaptive selection of various model parameters. Keles et al. propose to use likelihood-based cross-validation for this purpose, based primarily on certain finite-sample optimality results derived by van der Laan et al. (2003). I will go over detailed simulation results that compare the performance of this approach to that of a number of other model selection techniques. Among the other techniques considered are model selection based on the E-value of the resulting multiple alignment, model selection by AIC or BIC, as well as cross-validation based on the Euclidean norm between two position weight matrices. The performance of these estimators is examined not only in the context of choosing the motif width and an appropriate constraint set, as proposed by Keles et al., but also in the context of choosing the appropriate model type (OOPS, ZOOPS, or TCM). Our simulation studies show that likelihood-based cross-validation is in fact outperformed in each of these model selection problems by a method that is more directly targeted at the parameter of interest.

I will also briefly describe a fast and scalable new implementation of the resulting algorithm, called cosmo, that is made available as a web application, an R package, and a stand-alone C program. More information on cosmo is available at cosmoweb.berkeley.edu.

Thursday, September 14th

A Statistical Learning Approach to Identifying Receptor-Like Kinases
Lee Chae
Department of Plant and Microbial Biology, UC Berkeley

Receptor-like kinase (RLK) proteins are emerging as an important class of cell-surface signaling molecule in plants. With more than 600 predicted members in Arabidopsis and sporting a wide range of sequence motifs, RLKs pose an interesting learning problem in regards to protein sequence classification. Here, we explore the use of a statistical learning algorithm known as a support vector machine (SVM) to classify members of the RLK family. We'll discuss our initial implementation of an SVM classifier based on specific, biologically important sequence characteristics of RLKs, such as key catalytic, binding, and structural domains. Additionally, we will present results of cross-validation tests to gauge the predictive ability of the classifier and demonstrate its efficiency against a current best method, a hidden Markov model. Finally, we will present plans to expand the investigation both computationally and experimentally.

Thursday, September 21st

Genome-wide evolutionary rates in lab and wild yeast
Professor Rachel Brem
Department of Molecular and Cell Biology, UC Berkeley

Wild organisms, when introduced into the laboratory, often undergo selection for easier growth and reproduction. This initial adaptation, and further genetic changes during propagation in the lab, may affect a wide range of genotypes and phenotypes, such that the biology of a lab organism may no longer reflect that of wild populations. This concern has been discussed in the recent literature for the model organism S. cerevisiae. We studied a common lab strain of yeast in the context of the two wild yeast isolates whose whole-genome sequences are currently available. We found that one of the two wild strains, an isolate from a California vineyard, exhibited a higher rate of protein evolution than the lab strain. Protein evolutionary rates along the recent lineages of closely related yeast strains were similar, consistent with a model in which no single strain has been through a uniquely severe bottleneck or regime of positive selection in the recent past. Our work provides preliminary evidence that the lab strain has not recently accumulated a load of deleterious, protein-coding mutations over and above what is observed in natural yeast habitats. This suggests that on the genomic scale, laboratory strains can be considered a reasonable model for wild yeast, a conclusion which has implications for the genetics of many domesticated and experimental organisms.

Thursday, September 28th

Package aroma.affymetrix - How to analyze huge Affymetrix data sets in R on a notebook
Dr. Henrik Bengtsson
Department of Statistics, UC Berkeley

Affymetrix provides high-density microarrays to analyze gene expressions, genotypes and copy numbers, DNA sequences, exon signals and more, and this for various organisms. The large variety of applications together with decreasing prices have made the Affymetrix platform more popular than ever before. The average number of arrays per experiment is growing, from consisting of only a few arrays a few years ago to literally thousands of arrays as we speak.

In our research on copy-number analysis we work with Affymetrix Mapping arrays, aka SNP chips. The current arrays investigate 500,000 SNPs along the chromosome, each SNP queried by 24-40 probes for both alleles. With more than 130MB file data per sample, it is clear that current solutions of working with all data in memory is not feasible.

I will present a solution, implemented in R, that guarantees analysis of Affymetrix data with bounded memory where disk space is the only limiting factor. This allows us to normalize and fit probe-level models (PLM) to virtual any number of arrays on a basic notebook (< 1GB RAM). It takes a modest computer 2-4 minutes per array to quantile normalization and fit PLMs to 250K SNPs.

Thursday, October 5th

Data fusion for protein classification and function prediction with kernel methods
Guillaume Obozinski
Department of Statistics, UC Berkeley

In this talk I'll present various aspect of two closely related projects. The CAPER project in collaboration with Steven Brenner's group that aims at classifying automatically protein in SCOP superfamilies based on sequence and structure information, and, the Kframe project: a collaboration with William Noble at the University of Washington which aims at predicting terms from the Gene Ontology based on various data types. One of the main difficulty in these classification problems is to integrate correctly heterogenous sources of information, which correspond to different data types (strings, vector, densities, phylogenies, ontologies) and whose relative importance and relevance in uncertain. Recent kernel methods as well as the Support Kernel Machine allow us to tackle these different difficulties. A second main difficulty is that these are large scale multi-class or structured classification problems that cannot be easily be solved as one block. Last but not least, in the context of biological prediction where false positives can have significant consequences quantifying correctly the confidence of sets of predictions is challenging but seriously needed.

Thursday, October 12th

Clustering functionally similar proteins using the Dirichlet process
Dr. Duncan Brown
Sirna Therapeutics

Automatic clustering of protein sequences is an important problem in computational biology. The recent explosion in genome sequences has given biological researchers a vast number of novel protein sequences. However, the majority of these sequences have no experimental evidence for their molecular function in the cell, and the responsibility for correctly annotating these sequences falls upon the bioinformatics community. Ideally, we would prefer to group sequences of similar or identical molecular function in an automatic fashion, without relying on experimental evidence. I present here a novel probabilistic model for performing this task. The model uses Dirichlet mixture densities to model amino acid preferences within each cluster, and places a Dirichlet process prior on the overall set of clusters. The Dirichlet process prior provides a statistical framework for clustering sequences without requiring prior knowledge of the number of clusters, and allows relatively easy sampling from the model posterior via MCMC. I will present results for protein families for which the functional clustering is known in advance, which suggest that the model breaks data accurately into functional subgroups.

Thursday, October 19th

Genotyping using Affymetrix SNP arrays
Dr. Yuanyuan Xiao
Department of Epidemiology and Biostatistics, UC San Francisco

Modern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polyphorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, for example, use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls.
In this talk, I will introduce a new algorithm (MAMS) we have developed, which combines single-array multi-SNP and multi-array single-SNP calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi-array methods. Using a set of publicly available HapMap arrays/samples with known genotypes (from other genotyping technologies) as benchmarks, we illustrate the performance of MAMS in comparison with existing genotyping algorithms. If time permits, I will also demonstrate exploratory findings regarding hybridization properties of the 500K SNP arrays and its implications in copy number analysis.

Joint work with Ru-Fang Yeh, Mark Segal and Jean Yang

Thursday, October 26th

What's wrong with computational gene prediction?
Professor Ian Korf
UC Davis Genome Center

Genome sequencing technology continues to advance at an impressive rate. This is leading to an explosion of genomic data from phylogenetically diverse organisms. One of the first questions asked of a new genomes is "how many genes does it contain?" and this is usually followed by "how many genes are unique to the species?". These questions require gene finding methods that can accurately define a complete catalog of genes. Since experimental methods can be costly, time-consuming, and restrictive, computational methods are quite useful for deriving the catalogs. The field of computational gene finding is approximately 20 years old. A pessimistic and unfortunately realistic view is that gene finding programs are not very accurate and have not advanced much in the last 10 years. I will describe some of the inherent problems in gene prediction and our work to address some of the more difficult issues.

Thursday, November 2nd

Systematic Approaches for Characterizing Cancer
Dr. Paul Spellman
Lawrence Berkeley National Laboratory

The understanding on biological processes as a set of interconnected events has always been dependent on the availability of tools to interrogate the mechanisms of the system. We are applying newly developed tools to understand the commonalities and differences among individual cancers by identifying molecular and systems based properties of cancer cell lines and tumors. Our effort is employing three distinct approaches to this end 1) genomic characterization using allele specific copy number measurements 2) exon specific transcriptional profiling and 3) modeling of signaling systems. I will discuss each of these efforts and their future application.

Thursday, November 9th

A Discrete Approach for Modeling Population Substructure
Dr. Eran Halperin
International Computer Science Institute, Berkeley

Whole genome disease association studies are becoming a common practice in the search for the etiology of complex disease. Such studies involve the genotyping of hundreds of thousands of SNP markers for thousands of individuals. One of the main obstacles in performing such studies is that the underlying population substructure could artificially inflate the p-values, thereby generating a lot of false positives. Although existing tools cope well with very distinct sub-populations, closely related population groups remain a major cause of concern.

In this talk, I will present a discrete algorithms approach to detect population substructure. The algorithm is based on a dissimilarity measure between individuals, for which we studied the asymptotic behavior. In particular, we show rigorously that the algorithm converges rapidly to the correct classification with the increase in the genetic distance between the sub-populations. I will describe some theoretical aspects of this measure, as well as empirical results of population substructure, when applied to the HapMap data and to simulated data. We compared our method to two state of the art methods (STRUCTURE and EIGENSTRAT). Our results indicate that the suggested algorithm is very efficient and more accurate than the other two algorithms.

This is a joint work with Kamalika Chaudhuri, Satish Rao, Shuheng Zhou and Srinath Sridhar.

Thursday, November 16th

Disease--Specific Genomic Analysis: Identifying the Signature of Pathologic Biology
Dr. Monica Nicolau
Department of Mathematics, Stanford University

The genomic era has brought about profound changes in the study of genetic mechanisms, with the infusion of mathematical tools to aid both traditional and novel biological techniques. High dimensional data like microarray expression, SNP, array CGH and proteomic data have been used to study a wide range of problems aimed at achieving a deeper and more global understanding of diseases. Identification of expression relevant to the biological problem being studied can however be a difficult task. Tests for statistical significance must always make tacit assumptions about the underlying biology, and different tests will highlight distinct aspects of this biology.

I will introduce a method for the analysis of pathologic biology that unravels the disease characteristics of high dimensional data. The method, Disease-Specific Genomic Analysis (DSGA), is intended to precede standard techniques like clustering or class prediction, and enhance their performance and ability to detect disease. DSGA measures the extent to which the disease deviates from a continuous range of normal phenotypes, and isolates the aberrant component of data. In several microarray cancer datasets, I will show that DSGA outperforms standard methods. I will then discuss novel results in breast cancer, highlighted by the use of DSGA. Although these examples focus on microarrays, DSGA generalizes to any high dimensional genomic/proteomic data.

This is joint work with Robert Tibshirani, Anne-Lise Borresen-Dale, and Stefanie Jeffrey.

Thursday, November 30th

Integration of diverse data types: an illustration with genomic data on breast cancer
Dr. Darlene Goldstein
Institut de Mathématiques, École Polytechnique Fédérale de Lausanne (EPFL)

Meta-analytic methods have been applied in the microarray context for combining study results. Such cases typically combine results for the same data type (e.g. gene expression data), but perhaps generated by different technologies (e.g. single channel or dual channel microarrays). Microarray platform comparison studies have shown that combining raw data instead of results is much more problematic.

Diverse information types from different studies may be combined on a number of levels: raw or adjusted data, parameter estimates, test statistics, p-values, or decisions. As no method is univerally optimal, the choice of level and technique depend on available data and study goals. One overall statistic which detects similar deviations from the null across studies is the (possibly weighted) combined Z-score. This statistic provides a reasonably flexible means of combining results as it does not require data of similar types or even common questions across studies, yet unlike combined log p-values it accumulates positive and negative evidence.

I will compare the combined Z-score with other methods of combination and illustrate its use on a set of microarray studies of breast cancer.

This is joint work with Pratyaksha Wirapati and Mauro Delorenzi of the Swiss Institute for Experimental Cancer Research and the Swiss Insitute of Bioinformatics.