Statistics and Genomics Seminar

PH 292, Section 013

Statistics and Genomics Seminar

Fall 2005

Thursday, September 8th

Monitoring the level of alternatively spliced mRNA trough genome wide micro-arrays
Dr. Marco Blanchette
Department of Molecular and Cell Biology, UC Berkeley

Higher eukaryotes exploit alternative pre-mRNA splicing to diversify their proteome, and to regulate gene expression with developmental stage-and tissue-specificity. Alternative splicing is prevalent, for example, the human genome contains 22,000 to 25,000 genes (less than twice the number of gene found in Drosophila melanogaster), with more than 60% of the genes are alternatively spliced. A striking example of the coding potential generated by pre-mRNA alternative splicing is found in the Drosophila Dscam gene which, through alternative splicing, as the potential to produce more than 33,000 different proteins. In order to get a better understanding of how alternative splicing regulates gene expression in Drosophila, we have developed a micro-array platform that is aimed at monitoring changes in the level of alternatively spliced mRNAs. In addition to providing a measurement of variation in gene expression, this platform enables us to identify changes in abundance of alternatively splice transcripts. A description of the platform used, our computational approach, as well as the different experiments performed using this platform will be presented.

Thursday, September 15th

A Computational Framework for Conditional Inference with an Application to Unbiased Recursive Partitioning
Professor Torsten Hothorn
Institut fuer Medizininformatik, Biometrie und Epidemiologie
Friedrich-Alexander-Universitaet Erlangen-Nuernberg

The pioneering work of R. A. Fisher, E. J. G. Pitman and B. L. Welch on randomization tests published in the 1930ies did not find its way to statistical practice for a long time. The conceptually simple principle of conditioning on all permutations of the data is helpful to address a huge class of independence problems. Given that powerful and flexible software implementations are available we argue that a fresh look at permutation tests is fruitful.

Based on the theoretical framework of permutation tests published by Strasser & Weber (1999) we propose a unified computational framework for conditional inference. Applications include tests on independence between two variables measured at arbitrary scales as well as multiple testing procedures. Based on this framework it is easy to implement conditional versions of well-known procedures like linear rank tests, Cochran-Mantel-Haenszel tests or linear association tests and less well-known methods like maximally selected two-sample statistics. Much more interesting is the fact that new strategies for assessing independence can be implemented and evaluated on the fly.

To illustrate the flexibility of both the theoretical and computational components, permutation tests are applied to remove the variable selection bias from recursive partitioning procedures. We show how tree-structured regression models can be embedded into a statistical framework, i.e., with control of well defined errors. Moreover, we suggest an internal stop criterion for trees based on multiple testing procedures applied to the observations in each node of a tree. Benchmark experiments show that statistical internal stopping performs at least as good as the conventional post-pruning approach.

Joint work with Kurt Hornik and Achim Zeileis, Wirtschaftsuniversitaet Wien.

Thursday, September 22nd

Detecting Cis-Regulatory Modules by Modeling Correlated Structures in Genomic Sequences
Qing Zhou
Department of Statistics, Stanford University

Cis-regulatory modules composed of multiple transcription factor binding sites control gene expression in eukaryotic genomes. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module binding sites. We illustrate the use of CisModule by its application to the discovery of a novel tissue-specific regulatory module in Ciona savignyi. In addition, comparative genomic studies show that regulatory elements are more conserved across species due to evolutionary constraints. Thus we further extend our approach to combine both module structures and cross-species orthology in motif discovery. We use a hidden Markov model (HMM) to capture the module structure in each species and couple these HMMs through multiple-species alignment. Our new method has been tested on both simulated and biological data sets, where significant improvement over other module discovery and phylogenetic motif discovery methods was observed.

Joint work with Wing Hung Wong.

Thursday, September 29th

Multiple Testing Procedures for Control of Tail Probability of Proportion of False Positives
Professor Mark J. van der Laan
Division of Biostatistics, UC Berkeley

A fundamental tool in the analysis of genomic data is a valid multiple testing procedure controlling a specified Type-I error rate. In past work (with Pollard, Dudoit) we have provided, for general hypotheses and test-statistics resampling based multiple testing procedures asymptotically controlling specified Type-I error rates at a specified level alpha. The proposed procedures differs from the resampling based multiple testing methodology presented in the book by Westfall and Young by the choice of null distribution, and as a consequence they could be shown to provide asymptotic control of the Type-I error, in general (no need for a so called subset pivotality condition). In this talk, we present a new multiple testing procedure (asymptotically) controlling the tail probability of the proportion of false positives (TPPFP) at user supplied proportion q at level alpha, which we call TPPFP-Empirical Bayes resampling based multiple testing procedure. This method combines our proposed null distribution for the test-statistics with the empirical Bayes model which has been previously used to control the FDR in work of John Storey and Brad Efron. We also highlight some ongoing work on pathway testing and variable importance testing in prediction.

Thursday, October 6th

New methods for detecting lineage-specific evolution of DNA
Dr. Katherine S. Pollard
Department of Biomolecular Engineering, UC Santa Cruz

Most DNA evolves neutrally, but selection (both positive and negative), mutation rate variation, and biased gene conversion can all alter the rate at which DNA substitutions, insertions, and deletions occur. Many methods have been proposed for using multiple species alignments to find sequences that are not evolving neutrally. There has been particular interest in DNA elements conserved across many species, for example, because these are likely to have been under negative selection, suggesting a functional role. Most comparative genomics methods have assumed evolutionary pressures are the same across all branches of a phylogeny and therefore have little power to detect elements that have come under selection or begun to drift on a single lineage. I will present two new methods motivated by this problem of lineage-specific evolution. Both methods are particularly useful for identifying noncoding sequences and have been efficiently implemented so that they can be used to screen entire genomes. The first is based on a phylogenetic hidden Markov model (phylo-HMM), and does not require the lineage of interest or element boundaries to be determined a priori. Becuase we do not assume that substitutions follow a Poisson process, this method can be applied with a wide range of molecular evolutionary models. Insertions and deletions (indels) are incorporated into the phylo-HMM by a simple strategy that uses a separately reconstructed indel history. The second method begins with a set of elements that are conserved in a particular phylogeny and then screens these for the subset whose substitution rate is significantly accelerated in an additional lineage. This approach has allowed us to find the fastest evolving sequences in the human genome. I will outline the methods and discuss results obtained by applying these to both simulated and real data sets. Finally, I will illustrate how population genetic methods can help us understand the evolutionary forces behind non-neutrally evolving DNA.

Thursday, October 13th

Generation and Analysis of Spatial Patterns of Gene Expression in Imaginal Discs
Cyrus Harmon
Department of Molecular and Cell Biology, UC Berkeley

We are generating images of the spatial extent of gene expression in developing Drosophila melanogaster larvae. By capturing images of drosophila imaginal discs that have been stained via in situ hybridization to labeled probes for specific genes we can determine the spatial patterns of expression of these genes. We have used gene expression microarrays to identify a large number of candidate genes which are then put into a high-throughput pipeline for image generation. We have applied techniques from computer vision to perform automated analysis of these images and are working on methods for analyzing and comparing spatial patterns. I will present an overview of the project, the methods used for gene selection, the methods used to perform automated learning of imaginal disc shape and alignment of the images of the stained discs and some initial results.

Thursday, October 20th

Locating transcription factor binding sites using ChIP on chip
Kasper D. Hansen
Division of Biostatistics, UC Berkeley

Recently high-resolution genomic tiling arrays have become available. By hybridizing samples exposed to chromatin immuno-precipitation on such an array and comparing them with control samples, it is possible to verify in-vitro transcription factor binding sites, as opposed to using purely computational tools. A suggestion for analyzing data from such an experiment is presented.

Thursday, October 27th

Multiply Conserved Non-Coding Elements: In search of functional classifications
Ben Brown
Graduate Group in Applied Science and Technology, UC Berkeley

We have conducted an examination of 2054 DNA sequences located throughout the human genome, conserved to at least 70% between human and fugu, and 98% between human and mouse. These sequences were selected to avoid all known coding regions, and therefore we expect compose a set of Conserved Non-coding Elements (CNEs). The extraordinary conservation of these elements across hundreds of millions of years of divergent evolution seems to imply substantial functional importance. It has been proposed that some of these CNEs serve as integration nodes in regulatory networks. We explore the evidence for this hypothesis, and, utilizing existing and novel computational methods, attempt to recapitulate regulatory interactions coded by the CNEs from analysis of primary sequence data and the tissue expression data of nearby genes. We present both methods and results.

Thursday, November 3rd

Localization of transcription factor binding sites via chromatin immunoprecipitation (ChIP) and high-density tiling arrays: an assay model, and its implications for analysis and interpretation
Richard Bourgon
Department of Statistics, UC Berkeley

High-density, short-oligonucleotide tiling microarrays have recently become available. When used in conjunction with traditional chromatin immunoprecipitation techniques, such arrays permit in vivo, genome-wide localization of transcription factor binding sites (or RNA polymerase, histone modifications and histone-modifying proteins, etc.).

In this talk I will introduce a statistical/physical model for the assay which makes two important predictions: (i) we can expect peak-like signal in the neighborhood of the phenomena under study, and (ii) there should be appreciable spatial correlation, even in "noise" regions far away from the loci of interest. Both of these predictions are borne out by actual data, and both have implications for increasing statistical power and avoiding false positives.

To date, several authors have proposed methods for the analysis of high-density "ChIP-chip" data. A few have taken advantage of (i), but none have acknowledged (ii). As a consequence, no existing method yields traditional p-values or a statistically-grounded means of selecting a cutoff for the test statistics it produces. I will present one simple, non-parametric approach -- still a work in progress -- which accommodates (ii) and produces FDR-corrected p-values from ChIP-chip data.

Thursday, November 10th

Regulatory network dependencies from genetic
variation and quantitative expression profiling
Professor David C. Kulp
Department of Computer Science, University of Massachusetts

The combination of whole genome expression profiling and polymorphic marker screening has emerged as an ideal genetic perturbation model to detect causal relationships among genes. By treating expression as a quantitative phenotype, linkage analysis can reveal associated regulatory loci. We developed an epistatic-like linkage model to jointly account for gene expression and genotype and precisely map regulator genes. A consideration of complete and reduced forms of the model provides the means to dissect regulator-target relationships as causal or merely dependent. In simulations we find that the model is robust with respect to multiple independent regulators and we show that, in yeast, regulator genes are accurately predicted and that regulatory modules derived from pairwise linkage have biological significance.

Thursday, November 17th

History-Adjusted Marginal Structural Models and Time-Dependent Causal Effect Modification
Maya Petersen
Division of Biostatistics, UC Berkeley

Marginal structural models (MSM) provide a powerful tool for estimating the causal effect of a treatment, particularly in the context of longitudinal data structures. These models, introduced by Robins, model the marginal distributions of treatment-specific counterfactual outcomes, possibly conditional on a subset of the baseline covariates. However, standard MSM cannot incorporate modification of treatment effects by time-varying covariates. In the context of clinical decision making such time-varying effect modifiers are often of considerable interest, as they are used in practice to guide treatment decisions for an individual. In this talk, I will introduce a generalization of marginal structural models, which we call history-adjusted marginal structural models (HA-MSM). These models allow estimation of adjusted causal effects of treatment, given the observed past, and are therefore more suitable for making treatment decisions at the individual level and for identification of time-dependent effect modifiers. In addition, HA-MSM identify a particular optimal decision rule for assigning treatment at each time point based on a subject's measured covariates up till that time point. I will provide a practical introduction to HA-MSM relying on an example drawn from the treatment of HIV, and discuss parameters estimated, assumptions, and implementation using standard software.

Thursday, December 1st

Analysis issues of oligonucleotide tiling array data
Professor Ru-Fang Yeh
Division of Biostatistics, University of California, San Francisco

The recent development of DNA tiling arrays have made it possible to experimentally annotate the genome and various protein-DNA interactions through unbiased interrogation of large genomic regions. Depending on its utility, data from tiling array experiments pose unique analytic challenges that are very different from traditional expression array analysis. In this talk, I will discuss the preprocessing and analysis issues of tiling array-based experiments for DNA copy-number alterations (array CGH) and histone modifications (ChIP-chip) using Nimblgen custom arrays, and offer our preliminary solutions.

Thursday, December 8th

Analysis of Ecological Data: Use of Phylogenetic Trees with Diversity Measurements
Elizabeth Purdom
Department of Statistics, Stanford University

One type of dataset from ecological studies comes from counting the number of species observed at various locations. Usually this data takes the form of a L x S contingency table, where each entry in the table gives the number of times species s was observed in location l. A common question for this kind of dataset is to measure the diversity of the ecological communities, as well as to meaningfully compare the composition of species in different locations. However phylogenetic relationships among species significantly affect notions of diversity and comparison of locations but often are not incorporated into the analysis. A recent method, Double Principal Coordinates Analysis (DPCoA) (Pavoine, et al, Journal of Theoretical Biology, 2004) incorporates phylogenetic distance between species into the comparison of locations. We show that DPCoA can be cast as PCA using a particular inner product. With this framework we can compare DPCoA to traditional methods of PCA and Correspondence Analysis, as well as to the traditional phylogenetic comparative method, for example, Felsenstein's Independent Constrasts. Furthermore, we briefly highlight how this approach is a special case of more general methods of incorporating graphical information in a data analysis. We demonstrate these results on a genomic analysis of microbacterial communities found within the human intestinal track (Eckburg et. al, Science, 2005).