PH 296

Statistics and Genomics Seminar

Spring 2004



Thursday, January 22nd

Parametric inference for biological sequence analysis
Professor Lior Pachter
Department of Mathematics, UC Berkeley

One of the key developments in biological sequence analysis during the past decade has been the emergence of graphical models as a unifying formalism for developing algorithms. Examples of graphical models that have been used include hidden Markov models for annotation and pair hidden Markov models for alignments, as well as tree models for phylogenetics. A single algorithm, the sum-product algorithm, solves many of the inference problems associated with different models. We will present a geometric version of the sum-product algorithm, called the polytope propagation algorithm, that can be used to analyze the parametric behavior of maximum a posteriori inference calculations. For example, in the case of hidden Markov models, the polytope propagation algorithm identifies all the parameters that result in the same Viterbi path for a fixed observed sequence. We also describe efficient algorithms for parametric sequence alignment with multiple parameters.

This is joint work with Bernd Sturmfels.


Thursday, January 29th

A general description of the Deletion/Substitution/Addition algorithm with an example in polynomial regression
Sandra E. Sinisi
Division of Biostatistics, UC Berkeley

Current statistical inference problems in genomic data analyses involve parameter estimation for high-dimensional multivariate distributions such that statistical models for the data generating distribution correspond to large and complex parameter spaces. A general estimation framework has been developed to address such problems which includes as three main steps: (1.) definition of the parameter of interest in terms of a loss function, (2.) construction of candidate estimators based on a loss function, and (3.) cross-validation for estimator selection and performance assessment. After having defined the parameter of interest as the risk minimizer for a particular loss function, one must generate a sequence of candidate estimators by minimizing the empirical risk over subspaces of increasing dimension approximating the complete parameter space. In this talk, I'll propose an aggressive and flexible algorithm for minimizing risk over index sets according to three types of moves for the elements in a given index set: deletions, substitutions, and additions. I refer to this general algorithm as the Deletion/Substitution/Addition algorithm. The main features of this algorithm will be summarized for tensor product basis functions (e.g tensor products of univariate polynomial basis functions in polynomial regression).


Thursday, February 5th

Piecewise Constant Estimation in Prediction of Survival Outcomes: Applications in Genomics
Annette Molinaro
Division of Biostatistics, UC Berkeley

Clinicians and researchers collect a tremendous amount of data on cancer patients in the hopes of finding significant prognostic factors. Medical studies commonly involve thousands of clinical, epidemiological, and genomic measurements collected on each patient, along with a time to the clinical event of interest, such as disease recurrence or death. Over the past several decades there have been numerous attempts to use nonparametric methods with this type of data to find an estimator of outcome. A common approach is to modify classification and regression trees (CART) (Breiman, et al., 1984) specifically for right censored data. This presentation includes a generalization of CART based on a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. In this approach, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated with CART using this loss function. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator.

As this strategy is not limited to CART, a new more aggressive algorithm for generating possible predictors will be introduced. This algorithm finds the best partition of the the feature space by optimizing a cross-validated risk estimate over possible moves including deletion, addition and substitution. To illustrate this approach, both CART and the new algorithm have been applied to simulation studies as well as example data from Comparative Genomic Hybridization array analysis. The proposed approach is applicable to numerous settings, including univariate and multivariate prediction and density estimation. Thus, this method provides a powerful predictive tool for linking complex data sets with censored (or non-censored) outcomes.


Thursday, February 12th

The SERA gene family in Plasmodium: Multiple criteria for assessing and improving inferred gene trees
Richard Bourgon
Department of Statistics, UC Berkeley

Plasmodium falciparum is the parasite responsible for the most acute form of malaria in humans. Recently, the serine repeat antigen (SERA) in P. falciparum has attracted attention as a potential vaccine target. In this talk, we will examine attempts to infer the evolutionary history of a family of SERA genes in 7 different species of Plasmodium, and two particular problems that the gene family presents for inference: significant variation in GC content between species, and the need to reconcile any inferred gene tree with one of two conflicting Plasmodium species trees that have appeared in the literature. We will see, first, that GC content CAN lead to incorrect inference in a naive analysis, and discuss several approaches for overcoming this problem. Next, we will examine a simple method for using a known or hypothesized species tree to distinguishing between speciation and duplication events in an inferred gene tree. This information is of interest in its own right, and can also be used to infer root location for unrooted gene trees, to select a species tree among a set of candidate trees, and to assess and improve on gene tree phylogenies obtained from standard distance methods, Maximum Likelihood, or Maximum Parsimony. In the present case, reconciliation suggested two minor improvements to a poorly resolved region of the inferred gene tree, and produced an inferred set of 8 missing or deleted SERA family genes. A review of new Plasmodium genome data made available after the original analysis was complete turned up sequences consistent with 7 of these 8 missing genes.


Thursday, February 19th

Multiple Testing Procedures: Applications to Genomics
Professor Sandrine Dudoit
Division of Biostatistics, UC Berkeley

We propose general single-step and step-down multiple testing procedures for controlling Type I error rates defined as arbitrary parameters of the distribution of the number of Type I errors (e.g., generalized family-wise error rate). The approach is based on a null distribution for the test statistics, rather than a data generating null distribution, and provides asymptotic control of the Type I error rate without the need for conditions such as subset pivotality. For general null hypotheses, corresponding to submodels for the data generating distribution, the proposed null distribution is the asymptotic distribution of the vector of null-value shifted and scaled test statistics. Resampling procedures (e.g., based on the non-parametric or model-based bootstrap) are provided to conveniently obtain consistent estimators of the null distribution. In the special case of family-wise error rate (FWER) control, this general approach reduces to the minP and maxT procedures based on minima of p-values and maxima of test statistics, respectively. We propose simple augmentations of FWER-controlling multiple testing procedures to control the proportion of false positives among the rejected hypotheses and the generalized family-wise error rate. Finally, we establish the equivalence between single-step multiple testing and error rate specific confidence regions.

The following applications to multiple testing problems in genomics will be discussed: identification of differentially expressed genes in microarray experiments, genetic mapping of complex traits using single nucleotide polymorphisms (SNPs), and testing for associations between microarray measures of gene expression and Gene Ontology (GO) annotations.

Joint work with Mark van der Laan, Katie Pollard, Merrill Birkner, and Melanie Courtine.


Thursday, February 26th

Conflicting principles in hypothesis testing
Professor Erich Lehmann
Department of Statistics, UC Berkeley

The talk considers statistical principles such as unbiasedness, minimaxity, monotonicity, Bayes, and likelihood ratio. I discuss a number of examples in which these principles lead to strongly conflicting solutions, and examine the difficult choices which then have to be faced.
Please note the change in location for this talk: Room 4 Haviland.


Thursday, March 11th

MDL and Its Application to Gene Clustering and Selection
Professor Bin Yu
Department of Statistics, UC Berkeley

Minimum Description Length (MDL) is a general principle for statistical modeling. It states that one should choose the model that gives the shortest description of data. MDL is a fruitful cross-fertilization of ideas between statistics and information theory. It on one hand encompasses different approaches in statistics and on the other relates to basic concepts in coding and computing.

In this talk, we will give an introduction to MDL and its basic forms for model selection. Then we will concentrate on a particular form gMDL for regression models that bridges between AIC and BIC. Finally, we will apply gMDL to simultaneously cluster and select subsets of genes based on microarray data.

(This is based on joint work with Mark Hansen and Rebecka Jornsten.)


Thursday, March 18th

Statistical Methods for Analyzing ChIP-High Density Oligonucleotide Array Data
Dr. Sunduz Keles
Division of Biostatistics, UC Berkeley

Affymetrix  has recently mapped the locations of three transcription factor binding sites along human chromosomes 21 and 22 (Cawley et al., 2004). The technology used consists  of chromatin immunoprecipitation (IP) followed by high density oligonucleotide hybridization of IP-enriched DNA and it generates a new type of genomic data for the identification of transcription factor binding sites.  We investigate this data structure and propose methods for analyzing it. The proposed methods are based on multiple testing procedures that control different types of false positive rate using a suitable test statistic. The test statistic used  is in the
form of a scan statistic and takes into account the spatial structure in the data. Exploring the  dependence structure of the test statistics provides a Bonferroni type adjustment for controlling the family-wise error rate. Simulation studies show that taking into account the spatial
structure of the data provides substantial improvement in the performance of the  multiple testing  procedures.

Application of the proposed methods to protein  localization data for transcription factor p53 identified many potential target regions along human chromosomes 21 and 22. Among these identified regions, 18% fall in the 3Kb vicinity of 5'UTR of a known orf or CpG island, 31% fall between codon start and end site of a known orf but not inside an exon. More than half of these targets show enrichment of p53 consensus binding site or very close matches to it.  Moreover, these targets include 13 experimentally  verified p53 binding regions of Cawley et. al (2004) and also have 49  additional regions that show more binding activity signal than these experimentally verified regions.
                                                                                                                                                    
Joint work with Mark J. van der Laan, Siew L. Teng, Sandrine Dudoit and Simon Cawley.


Thursday, April 1st

Evolutionary mixture models for phylogenetic motif finding
Alan Moses
Department of Biophysics, UC Berkeley

                            
The preferential conservation of transcription factor binding sites implies that non-coding sequence data from related species will prove a powerful asset to motif discovery. We present a unified probabilistic framework for motif discovery that incorporates evolutionary information. We treat aligned DNA sequence as a mixture of evolutionary models, for motif and background, and, following the example of the MEME program, provide an algorithm to estimate the parameters by Expectation-Maximization. We examine a variety of evolutionary models and show that our approach can take advantage of phylogenic information to avoid false positives and discover motifs upstream of groups of characterized target genes.


Thursday, April 15th

Iterative Conditional Fitting for Gaussian Ancestral Graph Models
Mathias Drton
Department of Statistics, University of Washington

Ancestral graph models, introduced by Richardson and Spirtes (2002), are a new class of graphical models that generalize the Bayesian networks, which have, for example, been used to recover gene interactions from microarray data (Friedman, Linial, Nachman and Pe'er, 2000). A key feature of ancestral graphs is that they can encode all conditional independence structures which may arise from a Bayesian network with selection and unobserved variables. Thus, ancestral graph models provide a potentially very useful framework for exploratory model selection when the existence of unobserved variables cannot be excluded.

In this talk, we consider Gaussian ancestral graph models and present a new algorithm for maximum likelihood estimation. We call this algorithm iterative conditional fitting (ICF) since in each step of the procedure, a conditional distribution is estimated, subject to constraints, while a marginal distribution is held fixed. We show that in the considered Gaussian case, ICF may be implemented by regressions on "pseudo-variables". The ICF algorithm is in duality to the well-known iterative proportional fitting algorithm, in which a marginal distribution is fitted for a fixed conditional distribution.

Joint work with Thomas Richardson, Department of Statistics, University of Washington.


Thursday, April 22nd

Optimizing Neural Network Architecture for Prediction of Protein Secondary Structure
Dr. Blythe Durbin
Division of Biostatistics, UC Berkeley

Neural networks are a popular method for predicting the secondary structure of a protein from its amino acid sequence (Rost and Sander, 1993, Jones, 1999). However, overfitting poses a serious obstacle to effective use of neural networks for this and other problems. Due to the huge number of parameters in a typical neural network, one may obtain a network fit which perfectly predicts the training data yet fails to generalize to other data sets. Overfitting may be avoided by altering the network topology so that some connections are removed, thus reducing the total number of parameters. In the area of secondary structure prediction, work has focused on optimizing the network architecture by hand based on subject-matter knowledge (Riis and Krogh, 1996). We propose instead a method for selecting an optimal network architecture in a data-adaptive fashion using the Deletion/Substitution/Addition algorithm introduced in Sinisi and van der Laan (2003) and Molinaro and van der Laan (2003), and demonstrate the application of this approach to protein secondary-structure prediction.

Jones, D. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol, 292, 195--202.

Molinaro, A. and van der Laan, M. (2003) A Deletion/Substitution/Addition algorithm for paritioning the covariate space in prediction. Technical report, Division of Biostatistics, UC Berkeley. (In preparation).

Riis, S.K. and Krogh, A. (1996) Improving prediction of protein secondary structure using structured neural networks and multiple sequences alignments. J. Comp. Biol., 3, 163--183.

Rost, B. and Sander, C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584--599.

Sinisi, S. and van der Laan, M. (2003) A general Deletion/Substitution/Addition algorithm in prediction. Technical report, Division of Biostatistics, UC Berkeley. (In preparation).



Thursday, April 29th

 Statistical complications of using molecular genotyping to improve estimates of antimalarial treatment efficacy: right-censored data with missing indicators of failure
Professor Alan Hubbard
Divisions of Biostatistics and Environmental Health, UC Berkeley

In sub-Saharan Africa, malaria remains a leading cause of morbidity and mortality, and its control is dependent on the provision of prompt and effective treatment.  However, the rapid spread of antimalarial drug resistance has threatened this approach and forced countries to re-evaluate their treatment policies.  These government policies are guided by clinical trials where patients with symptomatic disease are treated with drugs of interest and followed for 28 days to assess the proportion of patients who fail therapy.  However, because of the high incidence of malaria in many parts of Africa, it is difficult to determine whether recurrent disease during the 28 day follow up period is due to resistant parasites present at the time therapy was initiated (treatment failure) or new infections.   Molecular "fingerprinting" techniques have been developed to help make this distinction by comparing parasite strains present on the day recurrent disease occurred with strains present the day treatment wasinitiated.  When all strains are the same, outcomes are classified as
treatment failures (failure time of interest), and when all strains are different, outcomes are classified as new infections (censoring time).  Controversy exists on how to classify outcomes when recurrent disease is associated with both parasite strains present at the time of initial therapy and new parasite strains (termed indeterminate outcomes) - that is, when the indicator of failure is missing (could be treatment failure, could be new infection).  We explore inverse probability of censoring (IPCW) and other techniques for estimating survival when failure indicators are missing.



Thursday, May 6th

 Statistical complications of using molecular genotyping to improve estimates of antimalarial treatment efficacy: right-censored data with missing indicators of failure
Professor Ru-Fang Yeh
Department of Epidemiology & Biostatistics
University of California, San Francisco


Pre-messenger RNA splicing, which is responsible for precise removal of introns from primary transcripts, is an essential step in the expression of most eukaryotic genes especially in humans. It is estimated that at least 15% of the human disease-causing point mutations involve splicing defects, and that more than half of the human genes undergo alternative splicing, underscoring the importance of splicing regulation.
The conserved basal signals marking splice junctions are essential but not sufficient for precise exon recognition, suggesting substantial contribution by additional unknown features. We have developed a computational method, RESCUE, that predicts which sequences have functional activity by statistical analysis of exon-intron and splice site composition. Using large datasets of human gene sequences, this method identified many known, as well as novel enhancer and suppressor motifs in exons and introns. Representatives of all ten RESCUE-predicted exonic splicing enhancer (ESE) motifs were shown to possess significant enhancer activity in vivo, while point mutants of these sequences exhibited sharply reduced activity. Analysis of the single nucleotide polymorphism data also showed strong selection against mutations predicted to disrupt ESE function, supporting the functional roles of these predicted ESEs. The motifs identified enable prediction of the splicing phenotypes of sequence mutations in human genes and will aid in computational gene finding.
                                                                               


Thursday, May 13th

Phylogenetic closure operations
Professor Michael Steel
Biomathematics Research Centre
 University of Canterbury
 Christchurch, New Zealand


 Multi-state characters (functions from a set X of species into a state space S) can be highly informative about evolutionary relationships,  particularly when the state space (S) is large (as occurs with certain types of genomic data, such as gene order).  For example, if we want to reconstruct a binary tree that has (say) 1 million leaves, and if we apply compatibility techniques we will need at least 999,997 binary characters; yet it can be shown that only 4 non-binary ones can suffice.  However, this optimistically small number applies only if we are very lucky, or get to sit in judgement and decide which characters nature will provide for us as data - what if the characters evolve 'randomly' according to some Markov process? And what if we use other approaches to tree reconstruction rather than compatibility? In this talk I will survey some of the recent combinatorial and stochastic aspects of these questions for which phylogenetic closure operations have proved a very useful analytical tool. I will describe these operations in detail along with their combinatorial properties, and show how they can also be  applied to other problems, such as the construction of phylogenetic `super-networks'.
                                                              

Thursday, May 27th

Phylogenetic Invariants
Seth Sullivant
Department of Mathematics, UC Berkeley
 

Statistical models of evolution are algebraic varieties in the space of joint distributions on the leaf colorations of a phylogenetic tree.  The phylogenetic invariants of a model are the polynomials which vanish on the variety.  I'll discuss recent progress on identifying these phylogenetic invariants and explain how they might be useful in practice. No knowledge of polynomial algebra will be assumed. 

This is joint work with Nicholas Eriksson and Bernd Sturmfels.