PH 296

Statistics and Genomics Seminar

Spring 2003



Thursday, January 23rd

Detecting DNA regulatory motifs by incorporating positional trends in information content
Katherina Kechris
Department of Statistics, UC Berkeley

Model-based motif discovery methods, such as MEME and Gibbs Motif Sampler, are important tools for screening DNA regulatory regions to predict binding sites of transcription factors. Although these methods are very successful, they can be too general, having been developed for both DNA and protein sequences. They can be distracted by noisy signals in the data that are not characteristic of true transcription factor binding sites. We propose a simple extension to the underlying model of these methods to improve the prediction of real sites. Our method is based on the observation that examples of real sites show positional trends in the information content (or base conservation). We assign prior distributions to the frequency parameters of the model, penalizing deviations from a specified level of conservation type. Examples with both simulated and real data show that these changes improve the algorithm's ability to discover motifs with information content patterns typical of real binding sites.


Thursday, January 30th

Functional Genomics and Pesticide Exposure in Children and Pregnant Woman
Professor Nina T. Holland
Division of Environmental Health Sciences, UC Berkeley

Studies of the effects of widespread pesticide exposures on children's health have been limited but have suggested an association with birth defects, childhood cancer and neurodevelopmental problems. Susceptible subpopulations may differ in their genetic makeup and/or in expression of genes encoding key metabolic enzymes. For example, the human enzyme paraoxonase (PON1) detoxifies various organophosphate (OP) pesticides with different efficiencies and rates, depending on PON1 status, i.e. the polymorphism at position 192 and the level of PON1 enzyme activities. We measured PON1 status in mothers and their newborns from a longitudinal cohort of Latinos residing in an agricultural community in the Salinas Valley, California (www.chamacos.org). PON1 status plots for mothers and children from CHAMACOS separate them into three groups: PON1Q/Q192, PON1Q/R192 and PON1R/R192. Because of the low levels of enzyme activity, approximately half of cord blood samples were also genotyped by PCR analysis. Gene frequencies were 0.5 for both Q and R variants. The level of PON1 activity was the lowest in newborns with QQ genotype making them particularly vulnerable to OP exposure. The correlations of cord PON1 phenotype on newborns birth weight and body length were significant (p = 0.05). PON1 activities for pregnant women were significantly higher than in the newborns for all genotypes. In summary, functional genomics can help to determine if there are subpopulations that are more susceptible to OP poisoning, and, thus, will help to inform future policy decisions regarding allowable pesticide exposure to children and pregnant women.

Joint work with Clement Furlong2, Rebecca Richter2, Maria Bastaki1, Erin Weltzien1, Asa Bradman1, Rachel Jampsa2, Kelly Birch3, Brenda Eskenazi1.

1 Children's Environmental Health Center, University of California, Berkeley; CA 94720-7360, USA;
2 Children's Environmental Health Center, University of Washington, Seattle, WA 98105, USA;
3 Children's Hospital Oakland Research Institute, Oakland, CA 94609, USA.

Supported by: NIEHS and EPA.


Thursday, February 6th

Binding site detection by incorporating structural information
and
Model selection for regression on censored outcomes

Sunduz Keles
Division of Biostatistics, UC Berkeley

In the first half of the talk, I will talk about a new binding site detection method that incorporates structural transcription factor information and in the second half, I will present an asymptotically optimal model selection method for regression on censored outcomes.

Rapid developments in the sequencing technology in combination with microarray technology enable scientists to study gene expression and gene regulation at a genomic level. The identification of transcription factor binding sites using statistical modeling is an important component of such large scale studies. Among the methods developed for this purpose are the ones that utilize finite multinomial mixture models with specific implementations as MEME and Gibbs Motif Sampler. Although there are continuing attempts to enhance these models, there has been little or no effort to incorporate transcription factor specific information into these models. In this talk, I will present an extension of such mixture models that takes into account the characteristics of the true DNA binding sites. In particular, the parameter space of binding sites (which are represented as position weight matrices) are constrainted. These constraints could be as simple as specific base conservation at a specific locations or more global constraints on the information profile. I will present simulation studies and data analysis that illustrate the utility of such a model in identifying real binding sites.

In the second half, I will talk about model selection for regression problems with right censored outcome especially focusing on survival outcomes. Model selection is an incredibly important component of high dimensional statistical modeling. The recent utilization of microarray technology in cancer research is generating data sets with patient survival information and gene expression for thousands of genes and hence increasing attention to this statistical topic. Often these data sets are subject to right censoring elevating the challenge in selecting a good predictive model. In this talk, I will present a new cross-validation based model selection method for selecting among a set of given predictors of the right censored survival outcome. The central idea of the method lies in treating the risk of a given predictor as a full data parameter and estimating it based on the observed data. We have shown that under appropriate conditions, the proposed method performs as well the oracle procedure in model selection, namely the method is asymptotically equivalent to selecting the best predictor using the true data generating distribution of the full data. I will also illustrate this result with simulations in the context of histogram regression with right censored data.


Thursday, February 13th

Finding haplotype block boundaries using the minimum description length (MDL) principle
Dr. Eric C. Anderson
Department of Integrative Biology, UC Berkeley

In the last two years, a series of studies have documented linkage disequilibrium (LD) extending across long distances in the human genome. This long-range LD is not confined to occasional, distantly-separated marker pairs in LD; rather, these studies report a block-like LD structure. Concomitant with the blocks of LD, these studies also find low haplotype diversity within blocks, with as few as 4 to 6 haplotypes within a block accounting for upwards of 90% of all observed haplotypes.

The existence of haplotype block structure has serious implications for association studies. Accordingly, the last year has seen several proposals for finding haplotype blocks using SNP data from sampled chromosomes. Using Rissanen's two-stage minimum description length criterion, we have developed a new method for detecting haplotype blocks that simultaneously uses information about linkage disequilibrium decay between blocks, and the diversity of haplotypes within the blocks. We define a family of descriptive probability models indexed by the number and position of haplotype block boundaries and present a dynamic programming algorithm for finding the set of block boundaries that minimizes the two-stage description length. This method performs favorably on data simulated under a coalescent model with recombinational hotspots, as well as on previously published data.

Joint work with John Novembre, Department of Integrative Biology, UC Berkeley


Thursday, February 20th

Statistical Methods for Identifying Transcription Factor Binding Sites
Dr. Haiyan Huang
Biostatistical Science, Harvard School of Public Health
Department of Statistics, Harvard University

The completion of the genomes of model organisms represents just the beginning of a long march toward in-depth understanding of biological systems. One challenge in post-genomic research is the detection of functional patterns from full-length genomic sequences. This talk focuses on statistical methods in finding patterns with functional or structural importance in biological sequences, in particular the identification of transcription factor binding sites (TFBSs). Some of the underlying mathematical theories will be discussed as well.

TFBSs are often short and degenerate in sequence. Therefore they are often described by position- specific score matrices (PSSMs), which are used to score candidate TFBSs for their similarities to known binding sites. The similarity scores generated by PSSMs are essential to the computational prediction of single TFBSs or regulatory modules. We develop the Local Markov Method (LMM), which provides local p-values as a more reliable and rigorous alternative. Applying LMM to large-scale known human binding site sequences in situ, we show that compared to current popular methods, LMM can reduce false positive errors by more than 50% without compromising sensitivity.


Thursday, February 27th

Sequence Pattern Discovery under a Stochastic Dictionary Framework
Mayetri Gupta
Department of Statistics, Harvard University

Accurate identification of transcription factor binding sites (motifs), short conserved patterns of 6-20 nucleotides, in DNA sequences is the first step in acquiring a detailed understanding of gene regulation. Motif discovery is complicated by the fact that patterns are of unknown length, may have insertions or deletions, the total number of patterns that are present is unknown and varies widely, and the sequences contain noise in the form of low-complexity repeats that have no biological significance. We develop a novel framework for sequence pattern identification, extending the idea of a dictionary model (Bussemaker et al., 2000). In our initial model, patterns and single letters are assumed to be independently generated from an unknown dictionary of stochastic words and concatenated to form the sequence. Data augmentation algorithms are developed that iteratively update the pattern composition and locate pattern sites on the sequences. The methodology is extended to find patterns of unknown widths, and patterns with insertions and deletions. In eukaryotic genomes, motif detection is a more challenging problem as the patterns tend to be shorter, less well-conserved, and occur in multi-pattern clusters (regulatory modules). We extend the stochastic dictionary model to a module framework under which an evolutionary Monte Carlo-based state-space model selection algorithm is constructed. This enables us to identify an optimal set of patterns and obtain improved parameter estimates. The performance of these new methods will be demonstrated by both simulation studies and applications to bacterial and human genomes.


Thursday, March 6th

Folding of Random Sequences with Application to RNA-gene Finding
Professor Niels Richard Hansen
Department of Applied Mathematics and Statistics, University of Copenhagen, Denmark
Currently visiting Department of Statistics, Stanford University.

Small functional RNA-molecules have recently been found in great numbers. A class of them called microRNA's is believed to regulate the expression of protein coding genes, and in a pre-functioning state they have a typical fold-back structure. It is of great interest to be able to search entire genomes by computational methods to identify potential microRNA's, and one could look for small segments of the genome with a fold-back structure comparable to the known microRNA's.

In the talk, I will show some theoretical work on the formation of certain fold-back structures in random (iid. and Markov) sequences, and for the size and structure of typical micro-RNA's, the results suggest that the fold-back structure alone does not significantly distinguish small microRNA from random fold structures. To improve results, construction of a better scoring system for the structure and the use of evolutionary conservation will be discussed.


Thursday, March 13th

Alternative splicing and gene regulation
Professor Steven E. Brenner
Department of Plant and Microbial Biology, UC Berkeley

Alternative splicing produces multiple mRNAs from a single gene, while nonsense-mediated mRNA decay (NMD) is an endogenous surveillance system that causes degradation of mRNAs with premature termination codons. We recently discovered that one-third of reliable human alternative mRNA isoforms are candidate targets of NMD. We similarly found that NMD may have profound unrecognized roles in human disease. Consequently, the coupling of alternative splicing and nonsense-mediated mRNA decay governs the expression of numerous proteins. It is shown that this functional coupling can yield regulated unproductive splicing and translation (RUST). We reason that the contribution of alternative splicing to proteome diversity may be balanced by an as-yet unappreciated regulatory role in gene expression.

This talk will review the biological mechanisms of alternative splicing and nonsense-mediated mRNA decay, and the computational methods used to infer their action. I will describe our computational and experimental studies of natural human genes targeted for NMD. The talk will conclude with hypotheses regarding the coupling of alternative splicing and NMD for RUST.


Thursday, March 20th

Multiple testing procedures for gene expression data
Katherine Pollard
Division of Biostatistics, UC Berkeley

Gene expression studies produce data with which inferences can be made for thousands of genes simultaneously. Standard practice in multiple testing with gene expression data is to use t-statistics as test statistics and to control the error rate under the permutation distribution. In this talk, I will revisit the rationale behind such choices and suggest situations in which alternatives are more sensible. I suggest an optimal null distribution for testing hypotheses about gene-specific parameters which is a Kullback-Leibler projection parameter of the true data generating distribution on the submodel defined by the overall null hypothesis. I then propose to estimate this null distribution with maximum likelihood estimation. In many cases, one can control error rates (at least asymptotically) by resampling from this estimated null distribution. I contrast the optimal null distribution with the null distribution estimated by permutation methods and illustrate that the two are very different in the two sample problem whenever sample sizes are not equal. With real and simulated gene expression data, we have evaluated the finite sample performance of different choices of test statistics, estimated null distributions and multiplicity adjustments. I report results which indicate that parametric bootstrap methods perform best when the model for the data is known. But since this is rarely true in practice, non-parametric bootstrap methods are recommended. I also demonstrate that the standardized t-statistic's null distribution is harder to estimate than that of the difference in means.


Thursday, April 10th

Unified Cross-validation Methodology for Estimator Selection and Applications to Genomics
Professor Sandrine Dudoit
Division of Biostatistics, UC Berkeley

We describe a unified cross-validation methodology for selection among estimators. This general framework covers in particular a number of selection problems which have traditionally been treated separately in the statistical literature: predictor selection based on censored outcomes, predictor selection based on multivariate outcomes, density estimator selection, survival function estimator selection, and counterfactual predictor selection in causal inference. Finite sample and asymptotic optimality results are derived for the cross-validation selector for general data generating distributions, loss functions, and estimators. The asymptotic optimality result states that the cross-validation selector performs asymptotically as well as an optimal benchmark selector based on the true unknown data generating distribution. A broad definition of cross-validation is used in order to cover leave-one-out cross-validation, V-fold cross-validation, and Monte Carlo cross-validation. We will discuss applications of this general methodology to estimator selection problems in genomic data analysis, including the prediction of biological and clinical outcomes using microarray gene expression measures and the identification of regulatory motifs in DNA sequences.

Joint work with Mark van der Laan and Sunduz Keles.

Slides
Technical reports #124, 125, 126, and 130
Division of Biostatistics, UC Berkeley


Thursday, April 17th

Multiple Output Analysis
Professor Leo Breiman
Department of Statistics, UC Berkeley

The norm for prediction algorithms is univariate prediction--that is, predicting a single output (dependent) variable. Predicting and understanding multiple outputs is more difficult because of the dependence between the outputs. It is exactly this dependence that must be exploited to make the predictions and understanding of the mechanism more accurate. This has applications to curve analysis, repeated measures analysis, etc. I report on recent work resulting in an algorithm RF/mo for analyzing multiple outputs.


Thursday, April 24th

To pool or not to pool: an experience with GeneChips
Professor Terence P. Speed
Department of Statistics, UC Berkeley

A question many people ask is whether pooling mRNA from different samples and then running technical replicates of the pool can lead to savings in chips, with no downside. This talk presents the design and analysis of an experiment seeking an answer to this question. Unfortunately, the design was not as rigorous as it should have been, and so anomalies arose which compromise the study's conclusions. However, it does seem possible to say that there are risks with pooling, and that the likely advantages may not be great enough to warrant the risk.


Thursday, May 1st

Gene Mapping and Model Selection
Professor David O. Siegmund
Department of Statistics, Stanford University

The goal of gene mapping is to identify genes associated with specific traits, e.g., human diseases, quantitative traits in animal models of human diseases, quantitative traits in agriculturally important species, etc. An initial step often involves testing of markers throughout a genome for linkage to genes contributing to the trait. This involves selection of a model for the trait as a function of genotype and environment. The model may involve multiple, possibly interacting genes.

To map quantitative traits in experimental genetics, Broman and Speed (2002, JRSSB) have suggested use of the Bayes Information Criterion (BIC) for model selection. An issue they must face is that the conditions imposed by Schwarz (1968, Ann. Math. Statist.) to justify the BIC are not met in gene mapping problems, so the BIC penalty of (k/2)log n for choosing a model with k parameters (when the sample size is n) may not be appropriate. In this talk I will re-interpret the BIC argument in terms of p-values and the logic of Bahadur efficiency. I will then show by hypothetical numerical examples from the literature how this interpretation would function in selecting models for gene mapping.


Thursday, May 8th

Least Angle Regression, Forward Stagewise and the Lasso
Professor Trevor Hastie
Department of Statistics, Stanford University

Least Angle Regression (LARS) is a new model selection algorithm. It is a useful and less greedy version of traditional forward selection methods. Three main properties of LARS are derived.

  1. A simple modification of the LARS algorithm implements the Lasso, an attractive alternative to OLS that constrains the sum of the absolute regression coefficients. The LARS modification calculates all possible Lasso estimates for a given problem in an order of magnitude less computer time than previous methods.
  2. A different LARS modification efficiently implements epsilon Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm.
  3. A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates.
LARS and its variants are computationally efficient. We provide R and Splus software which enables one to fit the entire coefficient path for LAR, Lasso or Forward Stagewise at the cost of a single least squares fit.

There are strong connections between the epsilon forward stagewise regression and the boosting technique popular in machine learning. These connections offer new explanations for the success of boosting.

Joint work with Brad Efron, Trevor Hastie, Iain Johnstone, and Rob Tibshirani.