PH 296
Fall 2002

 

Index 

Home

Seminar

Discussion

 

Home -Seminar

Statistics and Genomics Seminar - Fall 2002

N. B. Some of the speakers have kindly provided the slides for their presentation. Click on the talk title to download a pdf version of the slides.

Thursday, August 29th

Prediction of gene expression in yeast using conserved sequence templates
Derek Chiang
UC Berkeley

Regulation of gene expression in response to environmental cues drastically impacts cellular metabolism and development. A fundamental problem is to understand how information underlying transcriptional changes is encoded in genome sequences. Transcriptional control regions for individual genes contain multiple short regulatory sequences that enable gene regulation. Whereas previous approaches incorporate the presence or absence of multiple regulatory sequences in predicting gene expression outcome, most of them ignore positional arrangements of these sequences. This work investigates the utility of positional sequence information as predictors of gene expression in the yeast, Saccharomyces cerevisiae.

I will describe two innovations for analyzing genome sequences. First, sequence conservation among closely related yeast species provides prior information regarding regulatory sequences. Secondly, a template approach that considers joint distributions of sequence pairs enables increased prediction specificity. The following combined method systematically evaluates the statistical significance of sequence pairs. Enriched joint conservation of sequence pairs is assessed using a chi-square test for independence. Next, sequence pair templates are constructed if conserved word pairs are closely positioned, as tested with a nonparametric bootstrap. Finally, subsets of genes matching these sequence templates are evaluated for coherent gene expression using a Komolgorov-Smirnov test to compare gene expression distributions.

Keywords: gene expression, yeast, classification, nonparametric methods, multiple testing.



Thursday, September 5th

From Mouse to Man: Understanding Complex Disease
Dr. Gary Peltz
Roche Bioscience

The integrated use of a novel SNP-based genotyping method and gene expression analysis using high-density oligonucleotide microarrays has markedly increased the rate at which complex biological processes can be analyzed. Application of these technologies to murine experimental models of human inflammatory disease has enabled genetic susceptibility loci to be rapidly identified. The identified murine susceptibility genes provide insight into pathways regulating human inflammatory disease susceptibility. The analysis of complex traits was also accelerated by a new computational method that predicts chromosomal regions regulating complex traits in mice.



Thursday, September 12th

Transcription Neighborhoods
Dr. Paul Spellman
Berkeley Drosophila Genome Project

Many genes that share function must be similarly expressed. Prokaryotes have solved this problem by physically linking many genes with common functions and transcribing them as a single message, guaranteeing proper regulation. Eukaryotes, for the most part have not used this mechanism, and until recently there has been relatively little evidence that the immediate context of a gene played a substantial role in gene regulation. A number of studies have now indicated that gene neighbors are frequently expressed in similar profiles. I will review the recent literature and describe our results.



Thursday, September 19th

Using False Discovery Rates in DNA Microarrays
Professor John D. Storey
Department of Statistics, UC Berkeley

DNA microarrays allow the simultaneous measurement of the expression levels of thousands of genes from a single sample of cells. A common experiment performed with this technology is to obtain a number of microarrays from two or more types of cells. A basic yet important question one can try to answer from these data is which genes show a statistically significant change in gene expression between the cell types. This task falls under the statistical heading of multiple hypothesis testing. One must (1) form a statistic for each gene, (2) calculate the null distribution, (3) form a set of significance regions, and (4) assess the false positives in some fashion. We briefly review these four issues, but concentrate on the last one. We argue that false discovery rates are particularly appropriate for assessing false positives in these experiments. We present some current developments in false discovery rate methodology that are well suited for these data.



Thursday, September 26th

Detecting structured motifs in DNA sequences
Sunduz Keles
Division of Biostatistics, UC Berkeley

Identification of transcription factor binding sites (motifs) is a major interest in contemporary biology. Recently, developments in comparative genomics and DNA microarrays increased the attention to this challenging problem. There are many proposed methods that in general aim to find the strongest, well represented signal(s) in the data. However, these signals may not always represent the biologically most interesting motifs. It has been also noticed that many DNA binding proteins, especially in bacteria, bind to motifs of specific entropy structures.

We develop a method for detecting these (possibly) weakly represented structured motifs. Specifically, we use a multinomial mixture model for the regulatory region, and apply specific entropy constraints to the motifs. In this talk, I will explain the methodology and present simulations that compares this method with a commonly used approach.



Thursday, October 3rd

Correlated Amino Acid Substitutions and Sequence Alignment
Dr. Gavin E. Crooks
Plant and Microbial Biology, UC Berkeley

Standard methods for protein sequence alignment and homology detection assume that the probability of an amino acid undergoing substitution during the course of evolution is uncorrelated with the identity of neighboring amino acids in the protein sequence. In this talk I will describe our novel pairwise sequence alignment algorithm that explicitly incorporates these nontrivial substitution correlations, and our extensive evaluation of the ability of our method to detect remote homologies.



Thursday, October 10th

Comparative Genomics Tools for Biological Discovery
Dr. Inna Dubchak
Genome Sciences Department, LBNL

The deluge of genomic sequence that is rapidly appearing in databases is leading to the need for faster and more robust programs for analyzing the data. For example, in addition to aligning single genes there is a necessity to align hundreds of kilobases of BACs or even entire genomes. The algorithmic challenges posed by these large datasets have been accompanied by user interface challenges, such as how to visualize information related to enormous datasets and how to enable users to interact with the data and the processing programs.

We have developed an integrated set of tool, which serves as the platform for comparative analysis of genomic sequences on a whole genome scale. They have proved to be efficient in finding genes and conserved non-coding elements potentially playing a role in gene regulation. Examples of using our tools for biological discovery will be presented.



Thursday, October 17th

Gene Expression Profiling in the Developing Central Nervous System
Professor John Ngai
Molecular & Cell Biology, UC Berkeley



Thursday, October 24th

A Global Optimization Approach to Protein Structure Prediction
Dr. Silvia Crivelli
Research Scientist, NERSC, QB3 Institute
Professor Teresa Head-Gordon
Department of Bioengineering, UC Berkeley


We describe our global optimization method called Stochastic Perturbation with Soft Constraints (SPSC), that makes good predictions of certain aspects of protein structure such as helices, sheets, and coil regions by a neural network, and then manifests the prediction as restraints to use within both a local optimization algorithm and as guidance within various global optimization frameworks. Our approach is also characterized by the use of an all-atom energy function that includes a novel hydrophobic solvation function derived from experiments that shows promising ability for energy discrimination against misfolded structures. We present the results obtained using our SPSC method and energy function for blind prediction in the 4th Critical Assessment of Techniques for Protein Structure Prediction (CASP4) competition, and show that our approach is more effective on targets for which less information from known proteins is available. In fact our SPSC method produced the best prediction for one of the most difficult targets of the competition, a new fold protein of 240 amino acids.



Thursday, October 31st

The step-down procedure to compute False Discovery Rates for Microarray data
Yongchao Ge
Department of Statistics, UC Berkeley

DNA microarrays allow the monitoring of expression levels in cells for thousands of genes simultaneously. For finding differentially expressed genes, two questions are in mind:
a) Which genes are differentially expressed.
b) How can we assign the significance levels for the differentially expressed genes.

I will introduce our recently developed step-down procedure which computes false discovery rates to answer those questions. This new procedure assumes no independence among the data and try to incorporate the dependent information by resampling from the data to improve power. Our procedure can clearly identify the number of differentially expressed genes, partly answer the question a), the computed false discovery rates partly answer question b). The step-down procedure has been applied to microarray datasets.



Thursday, November 7th

Analysis of cDNA array data from time course experiments: E.coli infection by phage lambda
Dr. Monica Nicolau
Department of Statistics, UC Berkeley and Department of Genetics, Stanford University

I will discuss some aspects of the mathematical analysis of time series experiments, involving data from cDNA arrays. Infection of E.coli by phage lambda provides a very useful model, due in part to the wealth of biological understanding of this system. We present a simple method for producing a noise distribution for the data, by extracting a multitude of "noise" expression profiles. This distribution can then be used to assign probability noise scores to individual genes, as well as similarity scores to collections of genes.



Thursday, November 14th

Classification of Gene Microarrays by Penalized Logistic Regression
Ji Zhu
Department of Statistics, Stanford University

Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this talk, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE performs consistently well, in both cross-validation and test samples. A fast algorithm for solving PLR also exists.

Joint work with Professor Trevor Hastie.



Thursday, November 21st

Topological Analysis of Site-specific Recombination
Dr. Mariel Vazquez
Department of Mathematics, UC Berkeley

The topology of DNA plays important roles in many cellular processes. Certain enzymatic reactions, such as site-specific recombination or topoisomerase action, can alter both topology and geometry of circular DNA molecules. I will here present my work on the analysis of Xer site-specific recombination. Analysis of the experimental data uses mathematical and computational knot theory.

All products of Xer recombination at directly repeated psi sites from circular unknotted DNA plasmids share the same topology (right-handed 4-crossing links). The tangle model is a mathematical tool which uses the enzyme-mediated changes in the topology of circular DNA substrates to compute enzyme binding and mechanism. Tangle analysis is here used to study Xer recombination. Under appropriate mathematical and biological assumptions, all possible topological mechanisms consistent with given experimental data are computed. A unique 3-dimensional topological mechanism is proposed to account for the enzymatic action.



Thursday, December 5th

Allele Sharing and Allelic Association in Affected Sib Pairs
Professor Laura Lazzeroni
Department of Biostatistics, Stanford University

The genotypes of affected siblings contain information about both allele sharing and allelic association, either of which can point to the presence of a disease-related gene. Allele sharing tests, also known as linkage or identity-by-descent tests, are designed to detect whether siblings who share the same disease also tend to inherit the same alleles at a genetic locus. Allelic association tests, such as the transmission-disequilibrium test, are designed to detect the association of the disease and a particular allele in the population at large. I will describe a new test based on a general model formulated in terms of family-specific relative risks. The test combines both types of information in order to obtain good power in the presence of heterogeneity, gene-environment interactions, multilocus effects, Hardy-Weinberg disequilibrium, linkage disequilibrium, and other unpredictable factors that might affect the disease. I will also show how observed levels of allele sharing and allelic association yield interesting clues about which genetic and population models are most plausible in light of the data.


To top

last updated September 10, 2002
sandrine@stat.berkeley.edu