Statistics and Genomics Seminar - Fall 2002
N. B. Some of the speakers have kindly provided the slides for their
presentation. Click on the talk title to download a pdf version of the slides.
Prediction of gene expression in yeast using conserved sequence templates
Regulation of gene expression in response to environmental cues drastically impacts cellular metabolism and development. A fundamental problem is to understand how information underlying transcriptional changes is encoded in genome sequences. Transcriptional control regions for individual genes contain multiple short regulatory sequences that enable gene regulation. Whereas previous approaches incorporate the presence or absence of multiple regulatory sequences in predicting gene expression outcome, most of them ignore positional arrangements of these sequences. This work investigates the utility of positional sequence information as predictors of gene expression in the yeast, Saccharomyces cerevisiae.
I will describe two innovations for analyzing genome sequences. First, sequence conservation among closely related yeast species provides prior information regarding regulatory sequences. Secondly, a template approach that considers joint distributions of sequence pairs enables increased prediction specificity. The following combined method systematically evaluates the statistical significance of sequence pairs. Enriched joint conservation of sequence pairs is assessed using a chi-square test for independence. Next, sequence pair templates are constructed if conserved word pairs are closely positioned, as tested with a nonparametric bootstrap. Finally, subsets of genes matching these sequence templates are evaluated for coherent gene expression using a Komolgorov-Smirnov test to compare gene expression distributions.
Keywords: gene expression, yeast, classification, nonparametric methods,
The integrated use of a novel SNP-based genotyping method and gene
expression analysis using high-density oligonucleotide microarrays has
markedly increased the rate at which complex biological processes can be
analyzed. Application of these technologies to murine experimental models
of human inflammatory disease has enabled genetic susceptibility loci to be
rapidly identified. The identified murine susceptibility genes provide
insight into pathways regulating human inflammatory disease susceptibility.
The analysis of complex traits was also accelerated by a new computational
method that predicts chromosomal regions regulating complex traits in mice.
Many genes that share function must be similarly
expressed. Prokaryotes have solved this problem by physically linking
many genes with common functions and transcribing them as a single
message, guaranteeing proper regulation. Eukaryotes, for the most
part have not used this mechanism, and until recently there has been relatively
little evidence that the immediate context of a gene played a
substantial role in gene regulation. A number of studies have now
indicated that gene neighbors are frequently expressed in similar profiles. I will review the recent literature and describe our results.
Using False Discovery Rates in DNA Microarrays
DNA microarrays allow the simultaneous measurement of the expression
levels of thousands of genes from a single sample of cells. A
common experiment performed with this technology is to obtain a
number of microarrays from two or more types of cells. A basic yet
important question one can try to answer from these data is which
genes show a statistically significant change in gene expression
between the cell types. This task falls under the statistical
heading of multiple hypothesis testing. One must (1) form a
statistic for each gene, (2) calculate the null distribution, (3)
form a set of significance regions, and (4) assess the false positives in some fashion. We briefly review these four issues, but concentrate on the last one. We argue that false discovery rates are particularly appropriate for assessing false positives in these experiments. We present some current developments in false discovery rate methodology that are well suited for these data.
Detecting structured motifs in DNA sequences
Identification of transcription factor binding sites (motifs) is a major interest in contemporary biology. Recently, developments in comparative genomics and DNA microarrays increased the attention to this challenging problem. There are many proposed methods that in general aim to find the strongest, well represented signal(s) in the data. However, these signals may not always represent the biologically most interesting motifs. It has been also noticed that many DNA binding proteins, especially in bacteria, bind to motifs of specific entropy structures.
We develop a method for detecting these (possibly) weakly
represented structured motifs. Specifically, we use a multinomial
mixture model for the regulatory region, and apply specific entropy
constraints to the motifs. In this talk, I will explain the methodology
and present simulations that compares this method with a commonly used
Correlated Amino Acid Substitutions and Sequence Alignment
Standard methods for protein sequence alignment and homology detection
assume that the probability of an amino acid undergoing substitution
during the course of evolution is uncorrelated with the identity of
neighboring amino acids in the protein sequence. In this talk I will
describe our novel pairwise sequence alignment algorithm that
explicitly incorporates these nontrivial substitution correlations, and
our extensive evaluation of the ability of our method to detect remote
Comparative Genomics Tools for Biological Discovery
The deluge of genomic sequence that is rapidly appearing in databases is leading to the need for faster and more robust programs for analyzing the data. For example, in addition to aligning single genes there is a necessity to align hundreds of kilobases of BACs or even entire genomes. The algorithmic challenges posed by these large datasets have been accompanied by user interface challenges, such as how to visualize information related to enormous datasets and how to enable users to interact with the data and the processing programs.
We have developed an integrated set of tool, which serves as the
platform for comparative analysis of genomic sequences on a whole
genome scale. They have proved to be efficient in finding genes and
conserved non-coding elements potentially playing a role in gene
regulation. Examples of using our tools for biological discovery will be presented.
Gene Expression Profiling in the Developing Central Nervous System
DNA microarrays allow the monitoring of expression levels in
cells for thousands of genes simultaneously. For finding differentially
expressed genes, two questions are in mind:
I will introduce our recently developed step-down procedure which
computes false discovery rates to answer those questions.
This new procedure assumes no independence among the data and try to
incorporate the dependent information by resampling from the data to
improve power. Our procedure can clearly identify the number of
differentially expressed genes, partly answer the question a), the
computed false discovery rates partly answer question b). The
step-down procedure has been applied to microarray datasets.
I will discuss some aspects of the mathematical analysis of time series
experiments, involving data from cDNA arrays. Infection of E.coli by phage
lambda provides a very useful model, due in part to the wealth of biological
understanding of this system.
We present a simple method for producing a noise distribution for the data,
by extracting a multitude of "noise" expression profiles. This
distribution can then be used to assign probability noise scores to individual
genes, as well as similarity scores to collections of genes.
Classification of Gene Microarrays by Penalized Logistic Regression
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this talk, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE performs consistently well, in both cross-validation and test samples. A fast algorithm for solving PLR also exists.
Joint work with Professor Trevor Hastie.
The topology of DNA plays important roles in many cellular processes. Certain enzymatic reactions, such as site-specific recombination or topoisomerase action, can alter both topology and geometry of circular DNA molecules. I will here present my work on the analysis of Xer site-specific recombination. Analysis of the experimental data uses mathematical and computational knot theory.
All products of Xer recombination at directly repeated psi sites from circular unknotted DNA plasmids share the same topology (right-handed 4-crossing links). The tangle model is a mathematical tool which uses the enzyme-mediated changes in the topology of circular DNA substrates to compute enzyme binding and mechanism. Tangle analysis is here used to study Xer recombination. Under appropriate mathematical and biological assumptions, all possible topological mechanisms consistent with given experimental data are computed. A unique 3-dimensional topological mechanism is proposed to account for the enzymatic action.
The genotypes of affected siblings contain information about both allele sharing and allelic association, either of which can point to the presence of a disease-related gene. Allele sharing tests, also known as linkage or identity-by-descent tests, are designed to detect whether siblings who share the same disease also tend to inherit the same alleles at a genetic locus. Allelic association tests, such as the transmission-disequilibrium test, are designed to detect the association of the disease and a particular allele in the population at large. I will describe a new test based on a general model formulated in terms of family-specific relative risks. The test combines both types of information in order to obtain good power in the presence of heterogeneity, gene-environment interactions, multilocus effects, Hardy-Weinberg disequilibrium, linkage disequilibrium, and other unpredictable factors that might affect the disease. I will also show how observed levels of allele sharing and allelic association yield interesting clues about which genetic and population models are most plausible in light of the data.
last updated September 10, 2002