PH 296 Seminar abstracts

PH 296
Spring 2002

Index

Home

Seminar

Discussion

Home -Seminar

Seminar - Spring 2002

Thursday, January 17th

Maximum likelihood substitution rates using the EM algorithm
Dr. Ian Holmes
Berkeley Drosophila Genome Project

The usual model for amino acid substitution in molecular evolution and sequence analysis is a continuous-time finite-state Markov chain. For example, the PAM series of substitution matrices use such a model. In fact, continuous-time Markov models have wide application throughout biology (e.g. as models for ligand-gated ion channels, whose conductivity state can be measured by patch-clamp methods) and indeed in physics, chemistry, operations research and economics.

Previously, the parameters for such models --- i.e. the substitution rates --- have been estimated in molecular evolution either by considering very short pairwise branches in isolation (discarding a lot of useful information), by computationally intensive numerical optimisation, or by parametric simplifications that constrain the richness of the models.

In this talk I will describe how to estimate these rates directly, by combining the Expectation Maximisation algorithm with Pearl's belief propagation algorithm. I will also derive the Gamma-Dirichlet conjugate prior for rate models. As an example application, I'll describe our (GNU Public License'd) software to train rate matrices with hidden "site class" variables (representing biochemical context) from multiple alignments, and show how the introduction of such hidden variables gives a steady improvement in alignment accuracy when a probabilistic indel model is used. Finally, I will discuss some further applications of this work in bioinformatics (particularly in sequence evolution) and biophysics.

Thursday, January 24th

Poisson model for the coverage problem with a genomic application
Changxuan Mao
Division of Biostatistics, U.C. Berkeley

Suppose a population has infinitely many individuals and is partitioned into unknown $N$ disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a non-parametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework to infer any general abundance-$\mathcal{K}$ coverage, the sum of the proportions of those classes that contribute exactly $k$ individuals in the sample for some $k$ in $\mathcal{K}$, with $\mathcal{K}$ being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is presented, thereby providing new insight into them. As an application, a gene categorization problem in genomic research is addressed. Since Turing's non-parametric approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient in the mixture model.

Thursday, January 31th

Computational modeling of protein superfamily evolution
Kimmen Sjölander, Ph.D.
Assistant Professor
Department of Bioengineering, U.C. Berkeley

In this talk, I will present a method used to construct phylogenetic trees, identify subfamilies, and predict critical positions in protein molecules. This method employs agglomerative clustering to create the tree structure, and combines Dirichlet mixture priors and relative entropy to estimate the evolutionary relatedness of sequences and subgroups in the input multiple sequence alignment. Minimum description length principles are then employed to obtain a cut of the tree into subtrees to define the subfamilies. This method, Bayesian Evolutionary Tree Estimation (BETE), has been used at Celera Genomics to annotate the human genome with molecular function. BETE can also be used to predict binding pocket positions, with results shown on the SH2 domain family.

Reference: ismb98.ps.Z

Dr. Sjölander completed her Ph.D. in 1997 at U.C. Santa Cruz, where she worked with Professor David Haussler on computational tools for problems in molecular biology. While the Santa Cruz group became best known for the application of hidden Markov models to protein modeling, Dr. Sjölanders work at UCSC also included t he development of Dirichlet mixture priors, stochastic context-free grammars for RNA structure prediction, methods for protein fold prediction, and phylogenetic tree construction and subfamily classification (BETE). Following completion of her Ph.D., she joined Molecular Applications Group, a bioinformatics startup company founded by Stanford professor Michael Levitt, and continued work on algorithm development for protein superfamily analysis. As Chief Scientist of MAG, she oversaw the development of the Panther technology, a software suite for large-scale protein classification. In 1999, Celera Genomics acquired the MAG Panther technology and personnel, and used this technology to classify genes, as described in the Science issue devoted to the human genome.

Thursday, February 7th

Modeling Molecular Substitution
Von Bing Yap
Department of Statistics, U.C. Berkeley

Molecular substitution refer to the replacement of a DNA base for another as chromsomes are passed down. Such events are inferred from aligned homologous DNA and protein sequences. Substitution models have been used to estimate mutation rates and and test hypotheses which arise from population genetics. More recently, such models are used to derive substitution scores for sequence alignment. In this talk, a detailed description of modeling substitution by Markov chains will be given. Maximum likelihood estimation of the model parameters is applied to some data sets. I will also indicate some future directions that arise as more sequence data are accumulated.

PPT slides

Thursday, February 14th

A new method to identify significant clusters in gene expression data.
Katie Pollard
Division of Biostatistics, U.C. Berkeley

Clustering algorithms have been widely applied to gene expression data. For both hierarchical and partitioning clustering algorithms, selecting the number of significant clusters is an important problem and many methods have been proposed. Existing methods tend to find only the global patterns in the data (e.g.: two clusters consisting of the over and under expressed genes). We have noted the need for a better method in the gene expression context, where small, biologically meaningful clusters can be difficult to identify. In this talk, I will define a new criteria, Mean Split Silhouette (MSS), which is a measure of cluster heterogeneity. I propose to choose the number of clusters as that which minimizes MSS. In this way, the number of significant clusters is defined as that which produces the most homogeneous clusters. The power of this method compared to existing methods is demonstrated on simulated microarray data. The minimum MSS method can be applied to any clustering routine with a variety of applications.

PDF and PostScript slides

Thursday, February 21st

Recognition of Core Promoters in the Drosophila Genome
Dr. Uwe Ohler
Berkeley Drosophila Genome Project

The first step in the regulation of gene expression is the transcription of genes by RNA polymerase. The polymerase recognizes the transcription start site by interaction with additional factors that bind to the DNA, often in the so-called promoter regions around the transcriprion start site.

In higher eukaryotes, the promoter region can be far upstream of the actual coding part of a gene, which makes it necessary to come up with specific models of promoters. I will talk on our McPromoter system developed at the University of Erlangen and at the Berkeley Drosophila Genome Project. McPromoter uses a probabilistic approach similar to gene finding systems such as GenScan, aiming at the exact localization of transcription start sites. We also studied models that combine promoter sequence features which features derived from DNA structural properties.

The system is currently used to annotate promoters in the complete Drosophila genome, and I will point out some differences that can be seen between vertebrate (human) and invertebrate (Drosophila) promoter recognition.

Thursday, February 28th

Life in the cDNA and the Affymetrix worlds - two studies on gene expression measurement
Dr. Julia Brettschneider
Department of Statistics, UC Berkeley

Work on two microarray projects will be presented. The first is a study on cell death in mice brain performed on twelve cDNA dye swap pairs. The main issues are chip quality, design, and normalization. We compare the performance of two different image analysis programs (Spot and Incyte). Normalization based on print-tip groups is performed, and ranked gene lists are reported. The second part of the talk is on work in progress on Affymetrix chips. The chips measure gene expression in a number of different drosophila mutants. The oligonucleotide chip technology will be reviewed, and several recent analysis methods for this kind of chip will be explained.

PPT slides

Thursday, March 7th

DNA hybridization modeling and its application to the design of microarrays
Dr. Wynn L. Walker
Abgenix Inc.

In this presentation we discuss a mathematical model characterizing some of the effects of the molecular interactions involved in the hybridization process of sample to oligonucleotide probes on microarrays. The model is derived from published nearest neighbor model parameters for DNA:DNA, DNA:RNA, and RNA:RNA hybridization and characterizes potential adverse effects of molecular interactions such as oligo folding, RNA sample folding, and nonspecific RNA:RNA, or RNA:DNA interactions on the hybridization of sample to its target probe. The model predictions are applied to determine optimal procedures for sample preparation and probe design necessary to maximize sensitivity and specificity while minimizing variation in probe affinity. Experimental results confirming the predictions of the model are then presented.

Thursday, March 21st

Comparative Microbial Genomics: What Are We Learning?
Michael Y. Galperin
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bethesda, Maryland 20894

The availability of complete genomes has drastically changed modern biology by allowing an independent assessment of our level of understanding of the intimate workings of the prokaryotic and eukaryotic cell. With complete genome sequences, it finally becomes possible to catalogue all proteins that are responsible for every essential cellular function and to identify functions that are missing in each particular organism. A necessary prerequisite to such analysis is a comprehensive functional annotation of all the proteins encoded in each sequenced genome. Because, unfortunately, only a small fraction of proteins encoded in any genome have ever been or will ever be studied experimentally, computational analysis of protein sequences remains the key approach to functional characterization of the majority of genes in any given genome. We are going to discuss the principal hurdles in automated genome annotation and the use of the Clusters of Orthologous Groups database (http://www.ncbi.nlm.nih.gov/COG) to improve functional prediction. Several cases of "applied genomics" will be presented, where comparative studies of microbial genomes allowed to predict not only the enzymatic activities of new proteins, but also their 3D structures and even the catalytic mechanisms. We will use the patterns of distribution of homologous enzymes to judge whether the given metabolic pathway is present or absent in a given organism and to predict the existence of novel, still unidentified, enzymes. Finally, we will briefly discuss how comparative genome analysis can be used to identify novel potential targets for antibacterial drugs.

Thursday, March 21st

An introduction to stochastic context-free grammars
Xiaoyue Zhao
Department of Statistics, UC Berkeley

Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology. However, HMMs can't deal with long-distance pairwise correlations among bases. The language of RNA is dominated by nested pairwise correlations corresponding to base-pairing (secondary structure). Stochastic context-free grammars (SCFG) have been used to create probabilistic models that describe this RNA secondary structure. In this talk, I'll give an introduction to SCFGs. A formal definition will be given and the corresponding parsing push-down automata will be introduced. I also will talk about the SCFG algorithms for sequence modelling. Some basic applications of SCFGs to RNA secondary structure prediction will be introduced at the end.

Thursday, April 4th

Predicting reliable regions in protein sequence alignments
Dr. Melissa Cline
Bioinfocus Consulting

Protein sequence alignments have a myriad of applications in bioinformatics, including secondary and tertiary structure prediction, homology modeling, and phylogeny. Unfortunately, all alignment methods make mistakes, and mistakes in alignments often yield mistakes in their application. Thus, a method to identify and remove suspect alignment positions could benefit many areas in protein sequence analysis. We tested four predictors of alignment position reliability, including near-optimal alignment information, column score, and secondary structural information. We validated each predictor against a large library of alignments, removing positions predicted as unreliable. Near-optimal alignment information was the best predictor, removing 70% of the substantially-misaligned positions and 58% of the over-aligned positions, while retaining 86% of those aligned accurately.

Thursday, April 11th

Human crossover interference
Professor Karl W. Broman
Department of Biostatistics, Johns Hopkins University

Statistical analyses of human genetic data are generally performed with the assumption that the locations of crossovers in meiosis follow a Poisson process. Data on experimental organisms suggest that meiosis exhibits positive crossover interference: crossovers tend not to occur too close together. Using data on more than 8,000 genetic markers typed on eight large families, we have demonstrated the presence of positive crossover interference in human meiosis and further characterized its extent. We fit a gamma renewal process, which had previously been found to serve as a good model for meiosis in experimental organisms. We will briefly describe several surprising findings that came out of this work, emphasizing the importance of pursuing aberrations in data.

This is joint work with James L. Weber, Marshfield Medical Research Foundation.

Thursday, April 18th

Predicting homologous gene structures and exonic splicing enhancers in the human genome
Dr. Ru-Fang Yeh
Department of Biology, Massachusetts Institute of Technology

The sequence of the human genome provides the foundation for new approaches to study the organization and functions of human genes. In this talk, I will demonstrate the use of sequence analysis methods to address two different but closely related problems - identification of genes and exonic splicing enhancers.

A major challenge following the completion of the human genome project is to identify the locations and encoded protein sequences of all human genes. We have developed GenomeScan, a new gene identification program which combines the power of ab initio gene finding algorithm as in Genscan with database search results (such as blastX) in an integrated model. Accuracy from extensive testing and results of the application of GenomeScan to 2.7 billion bases of publicly available human genomic DNA will be discussed.

The vast amount of sequence data also allow us to study the association of sequence content with various biological process. Our PROFILER method uses a statistical analysis of exon-intron and splice site composition to screen for short oligonucleotide sequence motifs in exons that enhance pre-mRNA splicing (termed exonic splicing enhancers). Representatives of the predicted motifs were found to possess significant enhancer activity when tested in vivo, while point mutants exhibited sharply reduced activity as predicted. The experimental results verified the ability of PROFILER to predict the splicing phenotypes of exonic mutations in human genes.

Thursday, April 25th

A Variance-Stabilizing Transformation for Microarray Data
Blythe Durbin
Department of Statistics, UC Davis

Many traditional statistical methodologies rely heavily on the assumptions that the data are normally (or at least symmetrically) distributed, with constant variance. Gene-expression microarray data have a complicated distributional structure, which makes transformation necessary prior to analysis with standard techniques. At low expression levels (near the expression background) the data appear to be normally distributed with constant variance, and at high expression levels the data appear to have a lognormal distribution with constant coefficient of variation. Rocke and Durbin (2001) introduced a model for measurement error in microarray data which incorporates the error distribution at both low and high expression levels. I will introduce a transformation for microarray data which exactly stabilizes the delta-method variance of data distributed according to this model, as well as discussing a likelihood-based approach to estimation of the transformation parameter in the spirit of Box and Cox (1964).

Thursday, May 2nd

Combinatorial Approaches to Haplotyping
Professor Dan Gusfield
Department of Computer Science, UC Davis

The next high-priority phase of human genomics will involve the development of a full Haplotype Map of the human genome. It will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. However, most studies will collect genotype data rather than haplotype data, requiring the deduction of haplotype information from genotype data. Input to the haplotyping problem is a set of N genotypes, and output is an equal-size set of haplotype pairs that "explain" the genotypes. Several computational approaches to this problem have been developed and used. Most of these use a statistical framework, such as MLE,or statistical computing methods such as EM, Gibbs sampling etc.

We have developed several distinct combinatorial approaches to the haplotyping problem. I will talk about a few of the most recent of these approaches. One approach, the "pure parsimony" approach is to find N pairs of haplotypes, one for each genotype, that explain the N genotypes and MINIMIZE the number of distinct haplotypes used. Solving this problem is NP-hard, however, for reasonable size data (larger than in general use today), the "pure-parsimony" solution can be efficiently found in practice. I will also talk about an approach that mixes pure-parsimony with Clark's subtraction method for haplotyping. Simulations show that the efficiently of both methods depends positively on the level of recombination - the more recombination, the more efficiency, but the accuracy depends inversely onthe level of recombination.

I will also talk about my most recent approach, based on viewing the haplotyping problem in the context of perfect phylogeny. The biological motivation for this is the surprising fact that genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected. This, along with assumption of infinite sites implies that any "permitted" solution to the haplotyping problem should fit a perfect phylogeny, This is a severe combinatorial constraint on the permitted solutions to the haplotyping problem, and it leads to an efficient deterministic algorithm to deduce all features of the permitted haplotype solution(s) that can be known with certainty. We obtain a) an efficient algorithm to find (if one exists) one permitted solution to the haplotype problem; b) a simple test to determine if it is the unique permitted solution; c) an efficient way to count the number of permitted solutions; d) and an efficient way to implicitly represent the set of all permitted solutions so that each can be efficiently created. As a by-product, we prove that the number of permitted solutions is bounded by 2^k, where k is less than or equal to the number of sites. This is a huge reduction from the number of solutions if we do not require that a solution fit a perfect phylogeny.

Thursday, May 9th

Assessing genetic differences in gene expression among natural isolates of Saccharomyces cerevisiae
Dr. Justin Fay
Lawrence Berkeley National Laboratory

Changes in gene expression may play an important role in evolution. If so, there must be genetic variation in gene expression found in natural populations. To quantify this variation I have compared the expression profiles of six natural woodland isolates of Saccharomyces cerevisiae using spotted microarrays. Pseudo replication reveals a substantial amount of error in the measurement of gene expression differences. This error can be attributed to labeling, hybridization and scanning. Despite this error, a number of expression differences were found among the 4894 genes surveyed. The number of expression differences between any two strains varied from none to nearly 60. Biological implications will be discussed.

To top

last updated April 30, 2002
sandrine@stat.berkeley.edu