Statistics and Genomics Seminar
Spring 2005
Estimating evolutionary pathways by mutagenetic
trees
Niko Beerenwinkel
Department of Mathematics, UC Berkeley
Mutagenetic trees
constitute a class of graphical models to describe the dependency
structure of non-reversible genetic changes. We present efficient
methods for estimating mutagenetic trees and mixture models of these.
The techniques are applied to estimating evolutionary pathways of HIV
under pressure of antiviral therapy, and to chromosome alterations that
accumulate during tumor progression.
Quality measures for Affymetrix chips
Dr. Julia
Brettschneider
Department of Statistics, UC Berkeley
With microarray
technology getting more established in many branches of life science
research, scientists in both academia and corporate environments raise
their expectations for reliability and reproducibility of the
measurements. The quality of microarray data has emerged as a new
research topic suited to be approached cooperatively by
biotechnologists and statisticians. While there is a quality report
included in the MAS 5.0 output, the community of Affymetrix users is
still far from having established uniformly applied quality standards.
We introduce several new tools for both spatial and numerical quality
assessment. Our quality measures are based on probe level and
probeset level information obtained as a by-product of RMA (Irizarry et
al.). They provide convenient ways to search for individual chips of
poor quality, for quality trends over time, and for systematic quality
patterns related to experimental conditions or sample properties.
In the attempt to capture a variety of quality problems we test our
quality measures on data sets from very different sources reaching from
a small lab experiment with drosophila to a multi-site study on human
brains.
Detecting Gene Interaction in Affected Sib-Pair
Linkage Analysis
Ingileif B. Hallgrimsdottir
Department of Statistics, UC Berkeley
Linkage analysis
has proved to be a valuable approach for identifying disease genes
associated with Mendelian disorders, to date around 1200 genes have
been mapped. However, the success stories are scarse when it
comes to complex disorders, in which both environmental factors and
many, possible interacting, genes contribute to disease susceptibility.
I will present recent work on detecting (statistical) interaction from
affected sib-pair data, i.e. data where the families considered are
comprised of parents and two affected children. I will present a
new parametrization of the joint IBD probabilities at two loci that
allow us to model interaction. I will then discuss how this new
parametrization relates to variance-components and how it can be used
to develop tests for two-locus linkage.
Joint work with Terry Speed.
Environmental Exposures and the Molecular
Epidemiology of Childhood Leukemia: The Northern California
Childhood Leukemia Study
Dr. Catherine Metayer
Buffler Group, School of Public
Health, UC
Berkeley
Efficient Computation of Close Upper and Lower
Bounds on the Number of Recombinations Needed in Evolutionary History
Professor Daniel Gusfield
Department of Computer Science,
UC Davis
Meiotic
recombination takes two equal length sequences and produces a third
sequence of the same length consisting of some prefix of one of the
sequences, followed by a suffix of the other sequence. Meiotic
recombination is one of the principal evolutionary forces responsible
for shaping genetic variation within species, and other forms of
recombination allow the sharing of genetic material between species.
Efforts to deduce patterns of historical recombination or to estimate
the frequency or the location of recombination are central to
modern-day genetics, for example in ``association mapping".
In studying
recombination, a common underlying problem is to determine the
*minimum* number of recombinations needed to generate a given set of
molecular sequences from an ancestral sequence (which may or may not be
known), using some specified model of the permitted site mutations. The
common assumption for SNP sites is the *infinite sites model* in
population genetics, i.e., that any site (in the study) can mutate at
most once in the entire history of the sequences, so each site can take
on only two states, and the extant sequences are binary sequences.
We define Rmin(M) to be the minimum number of recombinations needed to
generate a set of sequences M from any ancestral sequence, allowing
only one mutation per site over the entire history of the sequences. No
polynomial-time algorithm to compute Rmin(M) is known, and a variation
of the problem is known to be NP-hard. Song and Hein developed an
algorithm that computes Rmin(M) exactly, but takes super-exponential
time. There are also polynomial-time algorithms that computes
Rmin(M) in special cases that arise frequently when the recombination
rate is ``modest'' (Gusfield et al.).
Since there is no
know efficient method to compute Rmin(M) exactly, several papers have
considered efficient computation of *lower bounds* on Rmin. By far, the
best method (balancing both time an accuracy) is encoded in a program
called RecMin, written by Simon Myers, and based on ``the haplotype
lower bound". RecMin requires the setting of parameters which affect
both the computation time used, and the quality of the result -- the
bound and the computation time are both non-decreasing with increasing
parameter values. We define the ``optimal RecMin bound" as the
lower bound that RecMin would produce if the parameters were set to
their maximum possible values. In general, it is not
feasible to use RecMin to compute the optimal RecMin bound.
In this talk we do
several things. First, we introduce an algorithm that uses Integer
Linear Programming to compute the Optimal RecMin Bound. Second, with
ideas that dramatically speed up the ILP, we show through extensive
experimentation using simulated and real data sets, that this approach
computes the Optimal RecMin Bound faster than RecMin (when RecMin can
compute it) and that it can efficiently compute the Optimal RecMin
Bound for problem sizes considered large in current applications (where
RecMin cannot compute the optimal bound). Third, we introduce
additional ideas that allow the algorithm to find lower bounds even
better than the Optimal RecMin Bound, and show through extensive
experiments that this approach remains practical on problem sizes
considered large today. Thus, we provide a practical method that
is superior to all other known practical lower bound methods.
Fourth, on the Upper Bound side, we present an efficient algorithm
that, given sequences M, constructs a history that generates M using
recombinations and one mutation per site. The number of recombinations
used in the history provides an upper bound on Rmin(M), but the history
itself is of independent interest. Fifth, and most importantly,
through extensive experimentation with simulated and real data, we show
that the computed upper and lower bounds are frequently very close, and
are *equal* with high frequency for a surprisingly large range of data.
Thus, with the use of a very effective lower bound and an efficient
algorithm for computing upper bounds, this approach allows the
efficient *exact* computation of Rmin(M) with high frequency in a large
range of data. This is an important empirical result that is
expected to have a very significant impact. Programs implementing
the new algorithms discussed in this talk are available on the web.
Joint work with Yun
Song and Yufeng Wu.
Statistical methods for constructing
transcriptional regulatory networks
Dr. Biao Xing
Genentech
Transcriptional
regulatory networks specify regulatory interactions among regulatory
genes and between regulatory genes and their target genes. Uncovering
transcriptional regulatory networks helps us to better understand the
complex cellular processes and responses. We present two statistical
methods for constructing transcriptional regulatory networks using gene
expression data, promoter sequences, and transcription factor binding
sites. Both start from identifying active transcription factors under
each individual experiment, using a feature selection approach. The
first method employs a naive normal mixture model to classify the
transformed gene expression data for each transcription factor and uses
the posterior probability of being in the `induced' or `repressed'
classes to measure the strength of regulatory interactions. Evidence is
averaged across different experiments to infer the overall regulatory
network structures. The second method employs a causal inference model
to model the causal effect of a transcription factor on its potential
target genes. A nonparametric marginal structural model is built for
every transcription factor and gene pair, which also allows controlling
for potential confounding effects of other transcription factors on the
expression level of the gene. The p-value associated with the causal
parameter in each of these models is used to measure the regulatory
interaction strength. These results are used to infer the overall
regulatory interaction matrix and network structures. Simulation
studies and analysis of yeast data have shown that both methods are
capable of identifying significant transcriptional regulatory
interactions and uncovering underlying regulatory network structures
and both can be complementary to each other to maximize significant
findings.
Joint work with
Mark van der Laan.
Efficient Haplotype Analysis Tools
Dr. Eran Halperin
The International Computer Science Institute (ICSI)
Each person's genome contains two copies
of each chromosome, one inherited from the father and the other from
the mother. A person's genotype specifies the pair of bases at each
site, but does not specify which base occurs on which chromosome. The
sequence of each chromosome separately is called a haplotype. The
determination of the haplotypes within a population is essential for
understanding genetic variation and the inheritance of complex diseases.
Experimental determination of a person's component haplotypes is an
expensive and time consuming process, and it is more attractive to
first determine genotypes experimentally and then use them to compute
haplotypes. This computation is not simple and is complicated by the
fact that current sequencing technology often gives the DNA sequence
with some missing nucleotide bases at some positions.
In this talk I will introduce efficient and accurate maximum likelihood
based methods for the reconstruction of haplotype frequencies from
noisy haplotype data or from genotype data. I will also give a
high level description of HAP (www.icsi.berkeley.edu/~heran/HAP) - a
haplotype phase reconstruction tool. Finally, I will mention some
consequences of the use of HAP for the phasing of the genome wide
dataset released by Perlegen Sciences.
The main part of the talk is a joint work with Elad Hazan (Princeton).
HAP is a joint project with Eleazar Eskin (UCSD).
COMODE - A web application for constrained motif
detection
Oliver Bembom
Division of Biostatistics, UC Berkeley
Interpreting HIV mutations to predict response to
antiretroviral therapy: The deletion/substitution/addition (DSA)
algorithm for the estimation of direct causal effects
Maya Petersen
Division of Biostatistics, UC Berkeley
Our goal is to
estimate the causal effect of mutations detected in the HIV strains
infecting a patient on clinical virologic response to specific
antiretroviral drugs and drug combinations. We consider the
following data structure: 1) viral genotype, which we summarize as the
presence or absence of each viral mutation considered by the Stanford
HIV Database as likely to have some effect on virologic response to
antiretroviral therapy; 2) drug regimen initiated following
assessment of viral genotype (the regimen may involve changing some or
all of the drugs in a patient's previous regimen); and, 3) change in
plasma HIV RNA level (viral load) over baseline at twelve and
twenty-four weeks after starting this regimen.
The effects of a
set of mutations on virologic response are heavily confounded by past
treatment. In addition, viral mutation profiles are often used by
physicians to make treatment choices; we are interested in the direct
causal effect of mutations on virologic outcome, not mediated by choice
of other drugs in a patient's regimen. Finally, the need to consider
multiple mutations and treatment history variables, as well as
multi-way interactions between these variables, results in a
high-dimensional modeling problem. This application thus requires
data-adaptive estimation of the direct causal effect of a set of
mutations on viral load under a particular drug, controlling for
confounding and blocking the effect the mutations have on the
assignment of other drugs. We developed such an algorithm based on a
mix of the direct effect causal inference framework and the data
adaptive regression deletion/substitution/addition (DSA) algorithm.
Mapping Evolutionary Pathways of HIV-1 Drug
Resistance using Conditional Selection Pressure
Professor Christopher Lee
Center for Bioinformatics, UCLA
Multiple Testing and Error
Control in Graphical Model Selection
Dr. Mathias Drton
Department of Mathematics, UC Berkeley
Local False Discovery Rates
Professor Brad Efron
Department of Statistics, Stanford University