PH 292, Section 013
Statistics and Genomics
Seminar
Fall 2005
Thursday, September 8th
Monitoring the level of alternatively spliced
mRNA trough genome wide micro-arrays
Dr. Marco Blanchette
Department of Molecular and Cell
Biology, UC Berkeley
Higher eukaryotes exploit alternative pre-mRNA splicing to diversify
their proteome, and to regulate gene expression with developmental
stage-and tissue-specificity. Alternative splicing is prevalent, for
example, the human genome contains 22,000 to 25,000 genes (less than
twice the number of gene found in Drosophila melanogaster),
with more than 60% of the genes are alternatively spliced. A striking
example of the coding potential generated by pre-mRNA alternative
splicing is found in the Drosophila Dscam gene which, through
alternative splicing, as the potential to produce more than 33,000
different proteins. In order to get a better understanding of how
alternative splicing regulates gene expression in Drosophila, we have
developed a micro-array platform that is aimed at monitoring changes in
the level of alternatively spliced mRNAs. In addition to providing a
measurement of variation in gene expression, this platform enables us
to identify changes in abundance of alternatively splice transcripts. A
description of the platform used, our computational approach, as well
as the different experiments performed using this platform will be
presented.
Thursday, September 15th
A Computational Framework for Conditional
Inference with an Application to Unbiased Recursive Partitioning
Professor Torsten Hothorn
Institut fuer Medizininformatik,
Biometrie und Epidemiologie
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
The pioneering work
of R. A. Fisher, E. J. G. Pitman and B. L. Welch on randomization tests
published in the 1930ies did not find its way to statistical practice
for a long time. The conceptually simple principle of conditioning on
all permutations of the data is helpful to address a huge class of
independence problems. Given that powerful and flexible software
implementations are available we argue that a fresh look at permutation
tests is fruitful.
Based on the theoretical framework of permutation tests published by
Strasser & Weber (1999) we propose a unified computational
framework for conditional inference. Applications include tests on
independence between two variables measured at arbitrary scales as well
as multiple testing procedures. Based on this framework it is easy to
implement conditional versions of well-known procedures like linear
rank tests, Cochran-Mantel-Haenszel tests or linear association tests
and less well-known methods like maximally selected two-sample
statistics. Much more interesting is the fact that new strategies for
assessing independence can be implemented and evaluated on the fly.
To illustrate the flexibility of both the theoretical and
computational components, permutation tests are applied to remove the
variable selection bias from recursive partitioning procedures. We show
how tree-structured regression models can be embedded into a
statistical framework, i.e., with control of well defined errors.
Moreover, we suggest an internal stop criterion for trees based on
multiple testing procedures applied to the observations in each node of
a tree. Benchmark experiments show that statistical internal stopping
performs at least as good as the conventional post-pruning approach.
Joint work with Kurt Hornik and Achim Zeileis, Wirtschaftsuniversitaet
Wien.
Thursday, September 22nd
Detecting Cis-Regulatory Modules by Modeling
Correlated Structures in Genomic Sequences
Qing Zhou
Department of Statistics, Stanford
University
Cis-regulatory modules composed of multiple transcription factor
binding sites control gene expression in eukaryotic genomes. We propose
a hierarchical mixture approach to model the cis-regulatory module
structure. Based on the model, a new de novo motif-module discovery
algorithm, CisModule, is developed for the Bayesian inference of module
locations and within-module binding sites. We illustrate the use of
CisModule by its application to the discovery of a novel
tissue-specific regulatory module in Ciona savignyi. In addition,
comparative genomic studies show that regulatory elements are more
conserved across species due to evolutionary constraints. Thus we
further extend our approach to combine both module structures and
cross-species orthology in motif discovery. We use a hidden Markov
model (HMM) to capture the module structure in each species and couple
these HMMs through multiple-species alignment. Our new method has been
tested on both simulated and biological data sets, where significant
improvement over other module discovery and phylogenetic motif
discovery methods was observed.
Joint work with
Wing Hung Wong.
Thursday, September 29th
Multiple Testing Procedures for Control of Tail Probability
of Proportion of False Positives
Professor Mark J. van der Laan
Division of Biostatistics, UC Berkeley
A fundamental
tool in the analysis of genomic data is a valid multiple testing
procedure controlling a specified Type-I error rate. In past work (with
Pollard, Dudoit) we have provided, for general hypotheses and
test-statistics resampling based multiple testing procedures
asymptotically controlling specified Type-I error rates at a specified
level alpha. The proposed procedures differs from the resampling based
multiple testing methodology presented in the book by Westfall and
Young by the choice of null distribution, and as a consequence
they could be shown to provide asymptotic control of the Type-I error,
in general (no need for a so called subset pivotality condition). In
this talk, we present a new multiple testing procedure
(asymptotically) controlling the tail probability of the proportion of
false positives (TPPFP) at user supplied proportion q at level alpha,
which we call TPPFP-Empirical Bayes resampling based multiple testing
procedure. This method combines our proposed null distribution
for the test-statistics with the empirical Bayes model which has been
previously used to control the FDR in work of John Storey and Brad
Efron. We also highlight some ongoing work on pathway testing and
variable importance testing in prediction.
Thursday, October 6th
New methods for detecting lineage-specific evolution of DNA
Dr. Katherine S. Pollard
Department of Biomolecular Engineering, UC
Santa Cruz
Most DNA evolves
neutrally, but selection (both positive and negative), mutation rate
variation, and biased gene conversion can all alter the rate at which
DNA substitutions, insertions, and deletions occur. Many methods have
been proposed for using multiple species alignments to find sequences
that are not evolving neutrally. There has been particular interest in
DNA elements conserved across many species, for example, because these
are likely to have been under negative selection, suggesting a
functional role. Most comparative genomics methods have assumed
evolutionary pressures are the same across all branches of a phylogeny
and therefore have little power to detect elements that have come under
selection or begun to drift on a single lineage. I will present two new
methods motivated by this problem of lineage-specific evolution. Both
methods are particularly useful for identifying noncoding sequences and
have been efficiently implemented so that they can be used to screen
entire genomes. The first is based on a phylogenetic hidden Markov
model (phylo-HMM), and does not require the lineage of interest or
element boundaries to be determined a priori. Becuase we do not assume
that substitutions follow a Poisson process, this method can be applied
with a wide range of molecular evolutionary models. Insertions and
deletions (indels) are incorporated into the phylo-HMM by a simple
strategy that uses a separately reconstructed indel history. The second
method begins with a set of elements that are conserved in a particular
phylogeny and then screens these for the subset whose substitution rate
is significantly accelerated in an additional lineage. This approach
has allowed us to find the fastest evolving sequences in
the human genome. I will outline the methods and discuss results
obtained by applying these to both simulated and real data sets.
Finally, I will illustrate how population genetic methods can help us
understand the evolutionary forces behind non-neutrally evolving DNA.
Thursday, October 13th
Generation and Analysis of Spatial Patterns of Gene
Expression in Imaginal Discs
Cyrus Harmon
Department of Molecular and Cell Biology, UC Berkeley
We are
generating images of the spatial extent of gene expression in
developing Drosophila melanogaster larvae. By capturing images of
drosophila imaginal discs that have been stained via in situ
hybridization to labeled probes for specific genes we can determine the
spatial patterns of expression of these genes. We have used gene
expression microarrays to identify a large number of candidate genes
which are then put into a high-throughput pipeline for image
generation. We have applied techniques from computer vision to perform
automated analysis of these images and are working on methods for
analyzing and comparing spatial patterns. I will present an overview of
the project, the methods used for gene selection, the methods used to
perform automated learning of imaginal disc shape and alignment of the
images of the stained discs and some initial results.
Thursday, October 20th
Locating transcription factor binding sites using ChIP on
chip
Kasper D. Hansen
Division of Biostatistics, UC Berkeley
Recently high-resolution genomic tiling arrays have become available.
By hybridizing samples exposed to chromatin immuno-precipitation on
such an array and comparing them with control samples, it is possible
to verify in-vitro transcription factor binding sites, as opposed to
using purely computational tools. A suggestion for analyzing data from
such an experiment is presented.
Thursday, October 27th
Multiply Conserved Non-Coding Elements: In search of
functional
classifications
Ben Brown
Graduate Group in Applied Science and Technology, UC Berkeley
We have conducted an examination of 2054 DNA sequences located
throughout the human genome, conserved to at least 70% between human
and
fugu, and 98% between human and mouse. These sequences were selected to
avoid all known coding regions, and therefore we expect compose a set
of
Conserved Non-coding Elements (CNEs). The extraordinary conservation of
these elements across hundreds of millions of years of divergent
evolution
seems to imply substantial functional importance. It has been proposed
that some of these CNEs serve as integration nodes in regulatory
networks.
We explore the evidence for this hypothesis, and, utilizing existing
and
novel computational methods, attempt to recapitulate regulatory
interactions coded by the CNEs from analysis of primary sequence data
and
the tissue expression data of nearby genes. We present both methods and
results.
Thursday, November 3rd
Localization of transcription factor binding sites via
chromatin immunoprecipitation (ChIP) and high-density tiling arrays: an
assay model, and its implications for analysis and interpretation
Richard Bourgon
Department of Statistics, UC Berkeley
High-density,
short-oligonucleotide tiling microarrays have recently become
available. When used in conjunction with traditional chromatin
immunoprecipitation techniques, such arrays permit in vivo, genome-wide
localization of transcription factor binding sites (or RNA polymerase,
histone modifications and histone-modifying proteins, etc.).
In this talk I
will introduce a statistical/physical model for the assay which makes
two important predictions: (i) we can expect peak-like signal in the
neighborhood of the phenomena under study, and (ii) there should be
appreciable spatial correlation, even in "noise" regions far away from
the loci of interest. Both of these predictions are borne out by actual
data, and both have implications for increasing statistical power and
avoiding false positives.
To date, several
authors have proposed methods for the analysis of high-density
"ChIP-chip" data. A few have taken advantage of (i), but none have
acknowledged (ii). As a consequence, no existing method yields
traditional p-values or a statistically-grounded means of selecting a
cutoff for the test statistics it produces. I will present one simple,
non-parametric approach -- still a work in progress -- which
accommodates (ii) and produces FDR-corrected p-values from ChIP-chip
data.
Thursday, November 10th
Regulatory network dependencies from genetic
variation and quantitative expression profiling
Professor David C. Kulp
Department of Computer Science, University of Massachusetts
The combination
of whole genome expression profiling and polymorphic marker screening
has emerged as an ideal genetic perturbation model to detect causal
relationships among genes. By treating expression as a quantitative
phenotype, linkage analysis can reveal associated regulatory loci. We
developed an epistatic-like linkage model to jointly account for gene
expression and genotype and precisely map regulator genes. A
consideration of complete and reduced forms of the model provides the
means to dissect regulator-target relationships as causal or merely
dependent. In simulations we find that the model is robust with respect
to multiple independent regulators and we show that, in yeast,
regulator genes are accurately predicted and that regulatory modules
derived from pairwise linkage have biological significance.
Thursday, November 17th
History-Adjusted Marginal Structural Models and Time-Dependent Causal Effect Modification
Maya Petersen
Division of Biostatistics, UC Berkeley
Marginal structural models (MSM) provide a powerful tool for
estimating the causal effect of a treatment, particularly in the
context of longitudinal data structures. These models, introduced by
Robins, model the marginal distributions of treatment-specific
counterfactual outcomes, possibly conditional on a subset of the
baseline covariates. However, standard MSM cannot incorporate
modification of treatment effects by time-varying covariates. In the
context of clinical decision making such time-varying effect modifiers
are often of considerable interest, as they are used in practice to
guide treatment decisions for an individual. In this talk, I will
introduce a generalization of marginal structural models, which we
call history-adjusted marginal structural models (HA-MSM). These
models allow estimation of adjusted causal effects of treatment, given
the observed past, and are therefore more suitable for making
treatment decisions at the individual level and for identification of
time-dependent effect modifiers. In addition, HA-MSM identify a
particular optimal decision rule for assigning treatment at each time
point based on a subject's measured covariates up till that time
point. I will provide a practical introduction to HA-MSM relying on an
example drawn from the treatment of HIV, and discuss parameters
estimated, assumptions, and implementation using standard software.
Thursday, December 1st
Analysis issues of oligonucleotide tiling array data
Professor Ru-Fang Yeh
Division of Biostatistics, University of California, San Francisco
The recent development of DNA tiling arrays have made it possible to experimentally annotate the genome and various protein-DNA interactions through unbiased interrogation of large genomic regions. Depending on its utility, data from tiling array experiments pose unique analytic challenges that are very different from traditional expression array analysis. In this talk, I will discuss the preprocessing and analysis issues of tiling array-based experiments for DNA copy-number alterations (array CGH) and histone modifications (ChIP-chip) using Nimblgen custom arrays, and offer our preliminary solutions.
Thursday, December 8th
Analysis of Ecological Data: Use of Phylogenetic Trees with
Diversity Measurements
Elizabeth Purdom
Department of Statistics, Stanford University
One type of dataset from ecological studies comes from counting the number of species observed at various locations. Usually this data takes the form of a L x S contingency table, where each entry in the table gives the number of times species s was observed in location l. A common question for this kind of dataset is to measure the diversity of the ecological communities, as well as to meaningfully compare the composition of species in different locations. However phylogenetic relationships among species significantly affect notions of diversity and comparison of locations but often are not incorporated into the analysis. A recent method, Double Principal Coordinates Analysis (DPCoA) (Pavoine, et al, Journal of Theoretical Biology, 2004) incorporates phylogenetic distance between species into the comparison of locations. We show that DPCoA can be cast as PCA using a particular inner product. With this framework we can compare DPCoA to traditional methods of PCA and Correspondence Analysis, as well as to the traditional phylogenetic comparative method, for example, Felsenstein's Independent Constrasts. Furthermore, we briefly highlight how this approach is a special case of more general methods of incorporating graphical information in a data analysis. We demonstrate these results on a genomic analysis of microbacterial communities found within the human intestinal track (Eckburg et. al, Science, 2005).