PH 296, Section 33

Statistics and Genomics Seminar

Fall 2003



Thursday, September 11th

Understanding Array CGH Data
Dr. Jane Fridlyand
UCSF Comprehensive Cancer Center

The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic derangement seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers.

To utilize the spatial coherence between nearby clones, we use unsupervised Hidden Markov Models approach. The clones are partitioned into the states which represent underlying copy number of the group of clones. The method is demonstrated on the two cell line datasets with known copy number alterations for one of them. The biological conclusions drawn from the analyses are discussed.

Joint work with A. Snijders, D. Pinkel, D. G. Albertson, and A. N. Jain.


Thursday, September 18th

Poisson-based clustering methods for studying SAGE data
Professor Haiyan Huang
Department of Statistics, UC Berkeley

Serial Analysis of Gene Expression (SAGE) is one of the most widely used techniques for comprehensive analysis of gene expression. SAGE possesses several advantages: 1). No requirement for a prior knowledge of the sequences of the transcripts; 2). It provides absolute transcript numbers (by counting the expressed transcripts); thus it allows quantitative analysis of gene expression.

Since it was introduced, it has been instrumental in many important discoveries. However, one important nature of SAGE data, counts, has been neglected in the previous analysis methods. In this talk, I will present a Poisson-based clustering method, by which we consider and make use of the properties of counts in our analysis. This method is evidenced to be advantageous in analyzing developing and mature mouse retina data.


Thursday, September 25th

Data Transformations for Gene-Expression Microarrays
Dr. Blythe Durbin
Division of Biostatistics, UC Berkeley

Data from gene-expression microarrays have proven difficult to analyze, in part because the data fail to satisfy the assumptions on which many standard statistical techniques are based. The right data transformation, however, can bring the data more closely in line with assumptions such as normality of errors and constancy of variance, which can greatly simplify downstream analysis.

We present a family of data transformations, the generalized-log family, that stabilizes the variance of microarray data across the full range of expression values. We present Box-Cox-like maximum-likelihood and robust methods for estimating the transformation parameter. We introduce a two-parameter generalized-log transformation, which extends the generalized-log family via a shift constant. Finally, we present a simulation study that suggests that generalized-log-transformed microarray data follow a distribution consisting of a mixture of normal distributions, rather than a heavy-tailed distribution. Based on this result, we propose that robust estimation of the transformation parameter may not be necessary.


Thursday, October 2nd

New insights on nuclear architecture and DNA repair gained from quantitative analysis of mFISH data
Dr. Javier Arsuaga
UCSF Comprehensive Cancer Center

Chromosome aberrations are illegitimate rearrangements of the genome involving large (>1 Mb) DNA fragments and occurring during the early part of the cell cycle. Such large-scale structural changes are frequently associated with genetic diseases or cancer and are the signature of DNA damaging agents such as ionizing radiation. Many chromosome aberrations can be detected by combinatorial painting assays such as multiplex fluorescence in-situ hybridization (mFISH) or spectral karyotyping (SKY).

In this talk I will present two examples of how computational approaches can help obtain biological information from distributions of radiation induced chromosome aberrations. In the first example our group analyzed chromosome aberrations to characterize any systematic spatial clustering of chromosomes. We found two clusters of chromosomes, {1,16,19,21, 22} and {13,14,15,21,22} whose members appear on average closer to each other than randomness would predict. In a second example we analyzed a similar data set using CAS (chromosome aberration simulator) and the graph-theoretical concept of cycles to investigate chromosome aberration formation pathways. We concluded that cycles can distinguish between different mechanisms of chromosome aberration production and that the breakage-and-reunion pathway dominates homology-based mechanisms of aberration formation.


Thursday, October 9th

Assessing gene expression data quality for Affymetrix GeneChip microarrays
Francois Collin
GeneLogic

Sample preparation protocols include numerous qualitative assessments used to insure that good quality RNA is used in microarray hybridization experiments. The chip manufacturer also recommends many post hybridization assessments to verify that the data produced from chips are reliable. These include general image quality assessment and some analysis of intensity measures of specialized probes. The recommended assessments are primarily qualitative and intended to insure that chips analyzed together are comparable. It is not clear a priori how the recommended assessments of chip quality relate to the expression values produced by the chips.

Exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip motivated a novel measure of gene expression based on a robust, multichip analysis (RMA) of probe intensity data. The RMA gene expression measures are obtained as estimated parameters in simple additive models fitted robustly to probe set intensity data on the log scale. In this talk I will review the RMA model and show how the fitted models can be used to produce chip quality indices that relate directly to the quality of gene expression measures produced by the chips. Through analysis of public data sets, I will show how these new measures are related to existing ones. I will discuss some analyses that have potential for diagnosing causes of departures from quality standards. If time permits, I will discuss assessment of experiment quality and chip data comparability.

Slides: PPT PDF


Thursday, October 23rd

Analysing Agilent arrays for Gene Expression in T-Cells
Professor Susan Holmes
Department of Statistics, Stanford University

We studied the difference in gene expression in melanoma patients' T-cells. The T-cells were subdivided into 3 homogeneous populations. We used variance stabilizing transformations to enable us to detect subtle expression differences between the different subsets and have found that modelling the residual distributions can be done quite effectively using a distribution that doesn't often come up in other Statistical applications. This provides a nice framework for doing parametric Bootstrap simulation studies used to calibrate the multiple tests in this example.

Joint work with:
Peter Lee, Hematology, Stanford,
Elizabeth Purdom, Statistics Stanford.


Thursday, November 6th

Bayesian analysis of gene expression levels for spotted DNA microarrays: Quantification of relative mRNA level across multiple strains or treatments
Dr. Jeffrey P. Townsend
Plant and Microbial Biology, UC Berkeley

Methods of microarray analysis that suit experimentalists using the technology are vital. Many methodologies cannot be flexibly applied to multifactorial experimental design. Here, I present a flexible, quantitative Bayesian framework based upon a ratio-of-normals distribution. This framework can be used to analyze normalized microarray data acquired by any replicated experimental design in which any number of treatments, genotypes, or developmental states are studied using a continuous chain of comparisons.

I apply this method to Saccharomyces cerevisiae microarray data sets on the transcriptional response to ethanol shock, to SNF2 and SWI1 deletion in rich and minimal media, and to wildtype and zap1 expression in media with high, medium, and low levels of zinc. The method is highly robust to missing data, and yields estimates of the magnitude of expression differences and experimental error variances on a per-gene basis. It reveals genes of interest that are differentially expressed at below the two-fold level, genes with high "fold-change" that are not statistically significantly different, and genes differentially regulated in quantitatively unanticipated ways.

The detection of small yet statistically significant differences in gene expression between samples in a spotted DNA microarray experiment is an ongoing challenge. This challenge requires careful examination of the performance of statistical models that may be employed to analyze the data, as well as an understanding of the effect of replication on the power to resolve these differences. Models incorporating additive and multiplicative small error terms, and error standard deviations that are proportional to expression level are derived. The power of these models to detect differences in gene expression is compared. The most powerful method, which uses additive small error terms and error standard deviations proportional to expression level, is also found to be the fastest method.


Thursday, November 13th

Searching for genetic factors in complex traits: Multiple sclerosis as an example
Professor Lisa Barcellos
Division of Epidemiology, UC Berkeley

Multiple sclerosis (MS) is a common inflammatory disorder of the central nervous system characterized by myelin loss, gliosis, varying degrees of axonal pathology, and progressive neurological dysfunction. It is the most common cause of acquired neurological dysfunction arising during early and mid-adulthood, and affects more than one million people worldwide. Like many other common disorders, the etiology of MS has a strong and complex genetic component. The hereditary tendency of this disease is indicated by both an increased relative risk in siblings compared with the general population, and an increased concordance rate in monozygotic, compared with dizygotic twins. The strongest and most consistent evidence for a susceptibility gene in MS is for the major histocompatibility complex (MHC) on ch.6p21.3. Although the MHC region contributes significantly to MS risk, much of the genetic effect in MS remains to be explained. In this presentation, methodology for the identification of genetic factors in MS and other complex diseases will be described, including the use of both linkage and association strategies. In addition, the advantages of a haplotype map-based approach will also be discussed.


Thursday, November 20th

Kernel methods, graphical models and bioinformatics
Professor Michael I. Jordan
Department of Statistics and Computer Science Division, UC Berkeley

A pervasive problem in bioinformatics is that of integrating data from multiple, heterogeneous sources. I discuss two general statistical methodologies for approaching this problem. The first is kernel methods, where recent developments based on semidefinite programming provide a general framework for combining multiple kernels. The second is probabilistic graphical models, a formalism that exploits the conjoined talents of graph theory and probability theory to build complex models out of simpler pieces. I illustrate these methods with applications to the problem of annotating proteins for function and cell localization, the problem of gene finding using data from multiple species, and the problem of haplotype phasing.

(with Nello Cristianini, Gert Lanckriet, Jon McAuliffe, William Noble, Lior Pachter, Roded Sharan and Eric Xing).


Thursday, December 4th

Into the woods: discovering genotype-phenotype relationships in modest sized data sets
Dr. David O. Nelson
Biotechnology Research Program, Lawrence Livermore National Laboratory (LLNL)

Data sets with many more variables than experimental units are becoming quite common in the post-genomic era, with expression array data being the most analyzed current example. Today I'm going to talk about another example with a great future: data sets that contain a large amount of genotype and phenotype information on a relatively modest number of subjects.

LLNL has developed a data set that enumerates the variation in DNA repair genes in a collection of some 80 human cell lines, and associates each cell line with a set of related phenotypes that attempt to provide an integrated, end-to-end measure of DNA repair capacity in response to ionizing radiation. This data set is being used to try to understand how (if?) large numbers of not-so-common SNPs combine to affect the average person's DNA repair capacity. The eventual goal is to develop a simple scoring system that will associate genotypes with diminished DNA repair capacity and, in the end of days, with increased risk for clinical outcomes of interest.

We are exploring how well tree-based approaches, such as random forests and boosting, can perform in this context. We are especially interested in how they perform in relation to current published approaches for analyzing this kind of data, which fall into two basic camps: linear regression coupled with classical variable selection techniques, and complete combinatoral enumeration coupled with some attempt at cross-validation.

In this talk, I will describe the data set and report on progress to date.


Thursday, December 11th

Supervised Grouping of Genes
Professor Peter Bühlmann
Seminar for Statistics, ETH Zurich

A challenging task with expression measurements from thousands of genes is to reveal group of genes whose collective expression is strongly associated with an outcome (response) variable of interest such as an observed tumor type. Unlike unsupervised clustering algorithms, we construct groups of genes by incorporating directly the observed response variable of interest. This is done within the framework of PEnalized LOgistic Regression Analysis and thus, we refer to our method as PELORA. We demonstrate empirically that PELORA identifies relatively few groups of genes whose expression centroids have good predictive potential (often superior to state-of-the-art classification methods based on single genes) while still yielding a drastic reduction in dimensionality. Moreover, PELORA can be used in conjunction with additional clinical variables to come up with a model and competitive predictions for medical prognosis. Statistical inference in such a model is proposed by leave-one-out bootstrap or cross-validation methods.