Seminar - Fall 2001
Identifying sequence motifs involved in the regulation of gene expression in yeast
From bacteria to humans, organisms simplify the regulation of genomic expression by coregulating genes that are involved in the same cellular processes. Much of the regulation of gene expression is mediated through specific sequence motifs in the genome, however an existing challenge in biology is identifying the motifs that are involved in regulation. This lecture will focus on two complimentary approaches that are being developed to identify regulatory sequences in yeast. In the first method, complete sets of coregulated genes are identified through fuzzy clustering of yeast genomic expression data, and sequence motifs common to the promoters of the coregulated genes are identified. The second method searches through sequence space to identify motifs that are optimal predictors of gene expression. The advantages and challenges of both methods will be discussed.
Yeast Genetic Footprinting: Exploring the Effects of Gene
Inactivation on a Genome-Wide Scale
Using insertional mutagenesis on a very large population of yeast cells, we have followed the fate of cells carrying independent inactivations of each of the 6000 yeast genes after growth under a variety of nutrient conditions. With this technique we hope to shed light on the function of uncharacterized genes as well as to determine the minimum genetic requirements of a yeast cell for its full growth potential. The data are complex and I'll describe some of the approaches we've taken to analyzing it (and hopefully get some ideas for further analysis from the audience!).
Applications of re-sampling methods to cluster analysis
A reliable and precise classification of tumors is essential for
successful diagnosis and treatment of cancer. cDNA microarrays and
high-density oligonucleotide chips are novel biotechnologies which are
being used increasingly in cancer research. By allowing the monitoring of
expression levels for thousands of genes simultaneously, such techniques
may lead to a more complete understanding of the molecular variations
among tumors and hence to a finer and more informative
Integrated Analysis of Data from Array-Based Experiments in Cancer
Biology is rapidly evolving into a quantitative molecular science. Many factors contribute to the acceleration of this evolution, including: completion of the human genome sequence, improvements in measurement technology for DNA, RNA, and proteins, and widespread efforts to characterize biological pathways in terms of specifically interacting molecular entities. The result of this trend is an increasing pool of quantitative biological data, which, for the first time, enables a systems-based approach to biological research. Given quantitative data, it is possible to induce constraints and interrelationships to construct predictive models of biological systems. Such models can provide context for the rapid interpretation of experimental observations. We believe that by integrating quantitative analytical methods and data visualization approaches with annotation information about biological entities, we can accelerate biological research by stimulating the generation of hypotheses that would otherwise be missed. The Jain Lab is focused on data relevant to the understanding of human cancer. We have a series of collaborations with UCSF Cancer Center investigators that are evolving extensive and growing sets of microarray-based expression data and/or high-resolution genomic copy number data in multiple malignancy types. This seminar will present methods for addressing quantitative questions in array-based data as well as examples of integrated analysis that combine experimental data with genomic and genetic annotations.
Some issues in low-level analysis of Affymetrix GeneChip data
The Affymetrix GeneChip technology provides a method of measuring gene expression. We will briefly introduce the genechip technology. We shall then consider issues involved in the analysis of the two basic types of probes on a GeneChip, the Perfect match (PM) and Mismatch (MM), and how these can be used for measuring gene expression. Through the use of experimental data we will investigate various approaches to normalization, background correction, chip/probe quality and calculating measures of expression.
Identification of regulatoy motifs using gene expression data
Many methods have been developed to identify regulatory motifs from transcription control regions of genes that show similarities in gene expression across a variety of experimental conditions. We develop a methodology that is driven by the strengths of focusing on a single experimental condition. The method utilizes gene expression data to identify regulatory motifs that are involved in regulation of genes for this particular experiment. After the extraction of appropriate features that are essentially short DNA sequences from transcription cotrol regions of the genes, a linear model with two way interractions is considered to explain gene expression by the extracted features. Selection of the most relevant features is achieved via a feature selection algorithm based on forward selection with cross validation. The method is applied to two publicly available data sets of yeast and produced successful results.
Multivariate survival models induced by genetic frailties, with
application to linkage analysis
Complex human diseases are often due to multiple diseases genes and both genetic and environmental risk factors. These diseases often also show variable age of disease onset. In order to incorporate both covariate and age of onset information into genetic linkage analysis, we derive a multivariate survival model for age of onset data of a sibship from an additive genetic gamma frailty model constructed based on the inheritance vectors. Based on this model, we propose a retrospective likelihood approach for genetic linkage analysis using sibship data. This test is an allele-sharing-based test, and does not require specification of genetic models or the penetrance functions. This new approach can incorporate both affected and unaffected sibs, environmental covariates and age of onset or age at censoring information, and therefore, provides a practical solution for mapping genes for complex diseases with variable age of onset. Small simulation study indicates that the proposed method has correct type 1 error rate and performs better than the commonly used allele sharing based methods for linkage analysis, especially when the population disease rate is high. We demonstrate the methods using a type 1 diabetes sib pair data set and a data set of affected sib pairs of prostate cancer. If time permits, I will also briefly discuss extending the proposed model to linkage analysis of two-locus disease model and to test of association in the presence of linkage.
Detecting Differenital Gene Expression in DNA Microarrays
DNA microarrays allow the simulataneous measurement of the expression levels of thousands of genes. A basic yet important question one may ask is which genes change across a set of microarrays arising from two or more biological conditions. In order to rigorously answer this question, one must perform a hypothesis test on each gene, forming an overall simultaneous error measure. We present a simple model to accomplish this task. The multiple comparisons are taken into account via the positive False Discovery Rate (pFDR), which allows both a frequentist and Bayesian interpretation of the model.
This is joint work with Robert Tibshirani and Bradley Efron.
Assessing Variation in DNA Repair Capacity Using the Comet Assay
The DNA in every cell in our body is constantly under attack from mutagenic agents that induce various kinds of damage. As a consequence, evolution has developed sophisticated mechanisms for recognizing and repairing damage to DNA, as well as making sure that any damage is repaired before the DNA is replicated and the damage becomes a permanent somatic mutation. Biologists have organized genes identified as being responsible for DNA repair into various pathways corresponding to the type of damage induced.
The genes in these pathways have been shown to vary between individuals in ways that may result in quantitative differences in the ability of cells to repair damaged DNA. These inter-individual repair differences seem to be a common underlying mechanistic risk factor various cancers.
At LLNL, we are developing an assay which will provide an integrated measure of an individual's DNA repair capacity for one of these pathways: Base Excision Repair. This assay is based on the "Comet assay", which uses image analysis to measure the electrophoretic migration of fluorescently-tagged DNA in single lysed cells after a mutagenic challenge.
In this talk I will discuss the some of the challenges in developing this assay, focusing on statistical issues in designing the assay, choosing appropriate outcome measures, and evaluating the resulting sources of variation, all with an eye towards developing simple predictors of genotype-phenotype and genotype-risk relationships.
Picking alignments from (Steiner) trees
We will begin by reviewing the definitions associated with "alignments" of sequences and the associated dynamic programming algorithms for finding optimal alignments. We will then explain the connection between alignments and hidden Markov models, and finally how to reduce the search space for alignments using Steiner and Manhattan networks. These methods have many applications ranging from fast alignment methods for gene finding, to floor planning for kitchens.
This is joint work with Fumei Lam.
Some aspects of the design and analysis of gene expression
The objective of experimental design is to make the analysis of data and the interpretation of the results as simple and powerful as possible, while keeping the purpose of the experiment and the constraints of the experimental material clearly in mind. In cDNA microarray experiments, a popular design choice is to use a common reference sample in every experiment. This approach provides an easy means of comparing multiple samples against one another, as well as permitting the combination of results from different experiments. We will show in theory and with experimental data that the common reference approach is inherently more variable than direct comparison. Furthermore, we demonstrate that in the absence of any common reference, combining different sets of experimental data can be done efficiently by performing A-optimal linked hybridizations between these sets of experiments. We will also describe different questions we have met and the experimental designs and analysis we have used to answer these questions. Experiments and data concerning the mouse olfactory system will be used to demonstrate our methods for analyzing multilevel factorial (including spatial and temporal) experiments.
last updated December 3, 2001