PH 296 Spring 2002 Index
|
Home
-Seminar
Seminar - Spring 2002
Maximum likelihood substitution rates using the EM algorithm
The usual model for amino acid substitution in molecular evolution and sequence analysis is a continuous-time finite-state Markov chain. For example, the PAM series of substitution matrices use such a model. In fact, continuous-time Markov models have wide application throughout biology (e.g. as models for ligand-gated ion channels, whose conductivity state can be measured by patch-clamp methods) and indeed in physics, chemistry, operations research and economics. Previously, the parameters for such models --- i.e. the substitution rates --- have been estimated in molecular evolution either by considering very short pairwise branches in isolation (discarding a lot of useful information), by computationally intensive numerical optimisation, or by parametric simplifications that constrain the richness of the models.
In this talk I will describe how to estimate these rates directly, by
combining the Expectation Maximisation algorithm with Pearl's belief
propagation algorithm. I will also derive the Gamma-Dirichlet conjugate
prior for rate models. As an example application, I'll describe our (GNU
Public License'd) software to train rate matrices with hidden "site class"
variables (representing biochemical context) from multiple alignments, and
show how the introduction of such hidden variables gives a steady
improvement in alignment accuracy when a probabilistic indel model is
used. Finally, I will discuss some further applications of this work in
bioinformatics (particularly in sequence evolution) and biophysics.
Poisson model for the coverage problem with
a genomic application
Suppose a population has infinitely many
individuals and is partitioned into unknown
$N$ disjoint classes.
The sample coverage of a random sample from the population
is the total proportion of the classes
observed in the sample. This paper uses a
non-parametric Poisson mixture model to
give new understanding and results for
inference on the sample coverage.
The Poisson mixture model provides
a simplified framework to infer
any general abundance-$\mathcal{K}$ coverage,
the sum of the proportions of those classes
that contribute exactly $k$ individuals in the sample
for some $k$ in $\mathcal{K}$,
with $\mathcal{K}$ being a set of nonnegative integers.
A new moment-based derivation of the well-known Turing
estimators is presented, thereby providing new insight into them. As an application, a gene categorization
problem in genomic research is addressed.
Since Turing's non-parametric approach is a moment-based method,
maximum likelihood estimation and minimum distance
estimation are indicated as alternatives for the coverage problem.
Finally, it will be shown that any Turing estimator
is asymptotically fully efficient in the mixture model.
Computational modeling of protein superfamily evolution
In this talk, I will present a method used to construct phylogenetic trees, identify subfamilies, and predict critical positions in protein molecules. This method employs agglomerative clustering to create the tree structure, and combines Dirichlet mixture priors and relative entropy to estimate the evolutionary relatedness of sequences and subgroups in the input multiple sequence alignment. Minimum description length principles are then employed to obtain a cut of the tree into subtrees to define the subfamilies. This method, Bayesian Evolutionary Tree Estimation (BETE), has been used at Celera Genomics to annotate the human genome with molecular function. BETE can also be used to predict binding pocket positions, with results shown on the SH2 domain family.
Reference: ismb98.ps.Z
Dr. Sjölander completed her Ph.D. in 1997 at U.C. Santa Cruz, where she worked with Professor David Haussler on computational tools for problems in molecular biology. While the Santa Cruz group became best known for the application of hidden Markov models to protein modeling, Dr. Sjölanders work at UCSC also included
t
he development of Dirichlet mixture priors, stochastic context-free grammars for RNA structure prediction, methods for protein fold prediction, and phylogenetic tree construction and subfamily classification (BETE). Following completion of her Ph.D., she joined Molecular Applications Group, a bioinformatics startup company founded by Stanford professor Michael Levitt, and continued work on algorithm development for protein superfamily analysis. As Chief Scientist of MAG, she oversaw the development of the Panther technology, a software suite for large-scale protein classification. In 1999, Celera Genomics acquired the MAG Panther
technology and personnel, and used this technology to classify genes, as described in the Science issue devoted to the human genome.
Modeling Molecular Substitution
Molecular substitution refer to the replacement of a DNA base
for another as chromsomes are passed down. Such events are inferred from
aligned homologous DNA and protein sequences. Substitution models have
been used to estimate mutation rates and and test hypotheses which arise
from population genetics. More recently, such models are used to derive
substitution scores for sequence alignment. In this talk, a detailed description of modeling substitution by Markov chains will be given.
Maximum likelihood estimation of the model parameters is applied to some
data sets. I will also indicate some future directions that arise as more
sequence data are accumulated.
A new method to identify significant clusters in gene expression data.
Clustering algorithms have been widely applied to gene expression data.
For both hierarchical and partitioning clustering algorithms, selecting
the number of significant clusters is an important problem and many
methods have been proposed. Existing methods tend to find only the global
patterns in the data (e.g.: two clusters consisting of the over and under
expressed genes). We have noted the need for a better method in the gene
expression context, where small, biologically meaningful clusters can be
difficult to identify. In this talk, I will define a new criteria, Mean
Split Silhouette (MSS), which is a measure of cluster heterogeneity. I
propose to choose the number of clusters as that which minimizes MSS. In
this way, the number of significant clusters is defined as that which
produces the most homogeneous clusters. The power of this method compared
to existing methods is demonstrated on simulated microarray data. The
minimum MSS method can be applied to any clustering routine with a variety
of applications.
Recognition of Core Promoters in the Drosophila Genome
The first step in the regulation of gene expression is the transcription of genes by RNA polymerase. The polymerase recognizes the transcription start site by interaction with additional factors that bind to the DNA, often in the so-called promoter regions around the transcriprion start site. In higher eukaryotes, the promoter region can be far upstream of the actual coding part of a gene, which makes it necessary to come up with specific models of promoters. I will talk on our McPromoter system developed at the University of Erlangen and at the Berkeley Drosophila Genome Project. McPromoter uses a probabilistic approach similar to gene finding systems such as GenScan, aiming at the exact localization of transcription start sites. We also studied models that combine promoter sequence features which features derived from DNA structural properties.
The system is currently used to annotate promoters in the complete
Drosophila genome, and I will point out some differences
that can be seen between vertebrate (human) and invertebrate (Drosophila) promoter recognition.
Life in the cDNA and the Affymetrix worlds - two studies on gene
expression measurement
Work on two microarray projects will be presented. The first is
a study on cell death in mice brain performed on twelve cDNA dye swap
pairs. The main issues are chip quality, design, and normalization. We
compare the performance of two different image analysis programs (Spot and
Incyte). Normalization based on print-tip groups is performed, and ranked
gene lists are reported. The second part of the talk is on work in
progress on Affymetrix chips. The chips measure gene expression in a
number of different drosophila mutants. The oligonucleotide chip
technology will be reviewed, and several recent analysis methods for this
kind of chip will be explained.
DNA hybridization modeling and its application to the design of microarrays
In this presentation we discuss a mathematical model characterizing some of
the effects of the molecular interactions involved in the hybridization process
of sample to oligonucleotide probes on microarrays. The model is derived from published nearest neighbor model parameters for DNA:DNA, DNA:RNA, and RNA:RNA hybridization and characterizes potential adverse effects of molecular interactions
such as oligo folding, RNA sample folding, and nonspecific RNA:RNA, or RNA:DNA interactions on the hybridization of sample to its target probe. The model predictions are applied to determine optimal procedures for sample preparation and probe design necessary to maximize sensitivity and specificity while minimizing variation in probe affinity. Experimental results confirming the predictions of the model are then presented.
Comparative Microbial Genomics: What Are We Learning?
An introduction to stochastic context-free
grammars
Hidden Markov models (HMMs) have been successfully applied to a
variety of problems in molecular biology. However, HMMs can't deal with
long-distance pairwise correlations among bases. The language of RNA is
dominated by nested pairwise correlations corresponding to base-pairing
(secondary structure). Stochastic context-free grammars (SCFG) have been
used to create probabilistic models that describe this RNA secondary
structure. In this talk, I'll give an introduction to SCFGs.
A formal definition will be given and the corresponding parsing push-down
automata will be introduced. I also will talk about the SCFG algorithms
for sequence modelling. Some basic applications of SCFGs to RNA
secondary structure prediction will be introduced at the end.
Predicting reliable regions in protein sequence alignments
Human crossover interference
Statistical analyses of human genetic data are generally performed with the assumption that the locations of crossovers in meiosis follow a Poisson process. Data on experimental organisms suggest that meiosis exhibits positive crossover interference: crossovers tend not to occur too close together. Using data on more than 8,000 genetic markers typed on eight large families, we have demonstrated the presence of positive crossover interference in human meiosis and further characterized its extent. We fit a gamma renewal process, which had previously been found to serve as a good model for meiosis in experimental organisms. We will briefly describe several surprising findings that came out of this work, emphasizing the importance of pursuing aberrations in data. This is joint work with James L. Weber, Marshfield Medical Research Foundation.
Predicting homologous gene structures and exonic
splicing enhancers in the human genome
The sequence of the human genome provides the foundation for new approaches to study the organization and functions of human genes. In this talk, I will demonstrate the use of sequence analysis methods to address two different but closely related problems - identification of genes and exonic splicing enhancers. A major challenge following the completion of the human genome project is to identify the locations and encoded protein sequences of all human genes. We have developed GenomeScan, a new gene identification program which combines the power of ab initio gene finding algorithm as in Genscan with database search results (such as blastX) in an integrated model. Accuracy from extensive testing and results of the application of GenomeScan to 2.7 billion bases of publicly available human genomic DNA will be discussed. The vast amount of sequence data also allow us to study the association of sequence content with various biological process. Our PROFILER method uses a statistical analysis of exon-intron and splice site composition to screen for short oligonucleotide sequence motifs in exons that enhance pre-mRNA splicing (termed exonic splicing enhancers). Representatives of the predicted motifs were found to possess significant enhancer activity when tested in vivo, while point mutants exhibited sharply reduced activity as predicted. The experimental results verified the ability of PROFILER to predict the splicing phenotypes of exonic mutations in human genes.
A Variance-Stabilizing Transformation for Microarray Data
Many traditional statistical methodologies rely heavily on the assumptions that the data are normally (or at least symmetrically) distributed, with constant variance. Gene-expression microarray data have a complicated distributional structure, which makes transformation necessary prior to analysis with standard techniques. At low expression levels (near the expression background) the data appear to be normally distributed with constant variance, and at high expression levels the data appear to have a lognormal distribution with constant coefficient of variation. Rocke and Durbin (2001) introduced a model for measurement error in microarray data which incorporates the error distribution at both low and high expression levels. I will introduce a transformation for microarray data which exactly stabilizes the delta-method variance of data distributed according to this model, as well as discussing a likelihood-based approach to estimation of the transformation parameter in the spirit of Box and Cox (1964).
Combinatorial Approaches to Haplotyping
The next high-priority phase of human genomics will involve the development of a full Haplotype Map of the human genome. It will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. However, most studies will collect genotype data rather than haplotype data, requiring the deduction of haplotype information from genotype data. Input to the haplotyping problem is a set of N genotypes, and output is an equal-size set of haplotype pairs that "explain" the genotypes. Several computational approaches to this problem have been developed and used. Most of these use a statistical framework, such as MLE,or statistical computing methods such as EM, Gibbs sampling etc. We have developed several distinct combinatorial approaches to the haplotyping problem. I will talk about a few of the most recent of these approaches. One approach, the "pure parsimony" approach is to find N pairs of haplotypes, one for each genotype, that explain the N genotypes and MINIMIZE the number of distinct haplotypes used. Solving this problem is NP-hard, however, for reasonable size data (larger than in general use today), the "pure-parsimony" solution can be efficiently found in practice. I will also talk about an approach that mixes pure-parsimony with Clark's subtraction method for haplotyping. Simulations show that the efficiently of both methods depends positively on the level of recombination - the more recombination, the more efficiency, but the accuracy depends inversely onthe level of recombination. I will also talk about my most recent approach, based on viewing the haplotyping problem in the context of perfect phylogeny. The biological motivation for this is the surprising fact that genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected. This, along with assumption of infinite sites implies that any "permitted" solution to the haplotyping problem should fit a perfect phylogeny, This is a severe combinatorial constraint on the permitted solutions to the haplotyping problem, and it leads to an efficient deterministic algorithm to deduce all features of the permitted haplotype solution(s) that can be known with certainty. We obtain a) an efficient algorithm to find (if one exists) one permitted solution to the haplotype problem; b) a simple test to determine if it is the unique permitted solution; c) an efficient way to count the number of permitted solutions; d) and an efficient way to implicitly represent the set of all permitted solutions so that each can be efficiently created. As a by-product, we prove that the number of permitted solutions is bounded by 2^k, where k is less than or equal to the number of sites. This is a huge reduction from the number of solutions if we do not require that a solution fit a perfect phylogeny.
Assessing genetic differences in gene expression among natural isolates of Saccharomyces cerevisiae
Changes in gene expression may play an important role in evolution. If so, there must be genetic variation in gene expression found in natural populations. To quantify this variation I have compared the expression profiles of six natural woodland isolates of Saccharomyces cerevisiae using spotted microarrays. Pseudo replication reveals a substantial amount of error in the measurement of gene expression differences. This error can be attributed to labeling, hybridization and scanning. Despite this error, a number of expression differences were found among the 4894 genes surveyed. The number of expression differences between any two strains varied from none to nearly 60. Biological implications will be discussed.
last updated April 30, 2002
|