PH 292
Statistics and Genomics
Seminar
Fall 2004
Thursday, September 16th
Quantification and Visualization of LD Patterns
and Identification of Haplotype Blocks
Yan Wang
Division of Biostatistics, UC Berkeley
Classical measures of linkage disequilibrium (LD) between two loci,
based only on the joint distribution of alleles at these loci, present
noisy patterns. In this work, we propose a new distance-based LD
measure, R, which takes into account multilocus
haplotypes around the two loci in order to exploit information from
neighboring loci. The LD measure R yields a matrix of pairwise
distances between markers, based on the correlation between the lengths
of shared haplotypes among chromosomes around these markers. Data
analysis demonstrates that visualization of LD patterns through the R
matrix reveals more deterministic patterns, with much less noise, than
using classical LD measures. Moreover, the patterns are highly
compatible with recently suggested models of haplotype block structure.
We propose to apply the new LD measure to define haplotype blocks
through cluster analysis. Specifically, we present a distance-based
clustering algorithm, DHPBlocker, which performs hierarchical
partitioning of an ordered sequence of markers into disjoint and
adjacent blocks with a hierarchical structure. The proposed method
integrates information on the two main existing criteria in defining
haplotype blocks, namely, LD and haplotype diversity, through the use
of silhouette width and description length as cluster validity
measures, respectively. The new LD measure and clustering procedure are
applied to single nucleotide polymorphism (SNP) datasets from the human
5q31 region (Daly et al. 2001) and the class II region of the human
major histocompatibility complex (Jeffreys et al. 2001). Our results
are in good agreement with published results. In addition, analyses
performed on different subsets of markers indicate that the method is
robust with regards to the allele frequency and density of the
genotyped markers. Unlike previously proposed methods, our new
cluster-based method can uncover hierarchical relationships among
blocks and can be applied to polymorphic DNA markers or amino acid
sequence data.
Reference:
Y. Wang and S. Dudoit (2004). Quantification and visualization
of LD patterns and identification of haplotype blocks. Technical Report
#150, Division of Biostatistics, UC Berkeley.
http://www.bepress.com/ucbbiostat/paper150
Thursday, September 23rd
Multiple Testing Procedures for Control of the
Generalized Family Wise Error Rate and Proportion of False Positives
Professor Mark van der Laan
Division of Biostatistics, UC Berkeley
A fundamental tool in the analysis of genomic (and, in general, high
dimensional) data is a valid multiple testing procedure controlling a
specified Type-I error rate.
In a series of articles Pollard, van der Laan (2004), Dudoit et al.
(2003) and van der Laan et al. (2003), we have provided, for
general hypotheses and test-statistics, single-step and step-down
resampling based multiple testing procedures asymptotically controlling
family wise error at a specified level alpha. The proposed procedures
differ from the currently used single step and step down
procedures by the choice of null distribution, and as a
consequence they could be shown to provide asymptotic control of family
wise error, in general (avoiding the need for the subset pivotality
condition). In this talk, we discuss the choice of null distribution
and its bootstrap estimate, and we show that any multiple testing
procedure (asymptotically) controlling family wise error at level alpha
can be augmented into 1) a multiple testing procedure (asymptotically)
controlling the generalized family wise error (GFWE) (i.e., the
probability of having more than k false positives) at level alpha and
2) a multiple testing procedure (asymptotically) controlling the
proportion of false positives (PFP) at user supplied proportion q at
level alpha. Given the multiple testing procedure controlling FWE, our
proposed procedures only involves very minor additional computations,
and the adjusted p-values of our procedures are trivial functions of
the adjusted p-values of the FWE-procedure. We also show some
simulation results comparing different proposed multiple testing
procedures.
Joint work with:
Sandrine Dudoit, Katherine Pollard, Merrilll Birkner
Division of Biostatistics, University of California, Berkeley
References:
K.S. Pollard, M.J. van der Laan (2004), Choice of Null
distribution in resampling based multiple testing, Journal of
Statistical Planning and Inference, Volume 125, 85-101.
S. Dudoit, M.J. van der Laan, K.S Pollard (2004), Multiple Testing.
Part I. Single-Step Procedures for Control of General Type I Error
Rates, Statistical Applications in Genetics and Molecular Biology Vol.
3: No. 1, Article 13.
http://www.bepress.com/sagmb/vol3/iss1/art13
M.J. van der Laan, S. Dudoit, K.S. Pollard (2004), Augmentation
Procedures for Control of the Generalized Family-Wise Error Rate and
Tail Probabilities for the Proportion of False Positives, Statistical
Applications in Genetics and Molecular Biology Vol. 3: No. 1, Article
15.
http://www.bepress.com/sagmb/vol3/iss1/art15
M.J. van der Laan, S. Dudoit, K.S Pollard (2004), Multiple Testing.
Part II. Step-Down Procedures for Control of the Family-Wise Error
Rate, Statistical Applications in Genetics and Molecular Biology} Vol.
3: No. 1, Article 14.
http://www.bepress.com/sagmb/vol3/iss1/art14
Thursday, September 30th
Microarray Gene Expression Data with Linked
Survival Phenotypes: Diffuse Large-B-Cell Lymphoma Revisited
Professor Mark Segal
Department
of Epidemiology and Biostatistics, University of California, San
Francisco
Regression
analyses, wherein (continuous) phenotypes are related to gene
expression obtained from microarray experiments, must accommodate
defining attributes of such data: high dimensional covariates (genes),
sparse samples (arrays) (p >> n), and complex between-gene
dependence. Censored survival phenotypes are additionally
complicating. A series of high profile studies relating gene
expression to post-therapy DLCBL survival provide examples. I
initially focus on the "lymphochip"-expression data and analysis of
Rosenwald et al., (NEJM, 2002). After describing relationships
between the analyses performed and gene harvesting (Hastie et al.,
Genome Biology, 2001) and indicating the potential for artifactual
solutions, I argue for the utility of regularized approaches, in
particular LARS-Lasso (Efron et al., Annals of Statistics, 2004).
While these methods have been extended to the proportional hazards /
partial likelihood setting, the resultant algorithms are
computationally burdensome. I develop residual-based
approximations that alleviate this burden yet perform comparably.
I conclude by briefly discussing some cross-study comparisons and
outlining possibilities for further work.
Thursday, October 7th
Error Control in Multiple Testing
Professor Joseph P. Romano
Department
of Statistics, Stanford University
Consider the
multiple testing problem of testing s
null hypotheses. In this talk, various
stepwise methods are constructed under constraints of
various notions of error control.
In the first
part of the talk, we assume a parametric family of
distributions which satisfies a certain
monotonicity assumption. Attention is
restricted to procedures that control the familywise error
rate (FWE) in the strong sense and which satisfy a monotonicity
condition. Under these
assumptions, we prove certain maximin optimality results for the well-known stepdown
and stepup procedures.
In the second
part, we consider the general problem of
contructing methods that control the FWE in a general
(nonparametric) setting. In order to improve upon the Bonferroni
method or Holm's (1979) stepdown method, Westfall and Young (1993) make effective use of
resampling to construct stepdown methods that
implicitly estimate the dependence structure of the test
statistics. However, their
methods depend on an assumption called subset pivotality. We will show how to
construct methods that control the FWE, both in
finite and large samples. A key ingredient is
monotonicity of critical values which allows one to
effectively reduce the multiple testing problem of
controlling the FWE to the single testing problem of
controlling the probability of a Type 1 error. Resampling methods
are then incorporated into the stepwise schemes.
In the final
part of the talk, alternative measures of error control will be
discussed. Explicit construction
of stepwise procedures for these alternative
methods will be presented as well.
(This talk is
based on collaborations with Erich Lehmann, Juliet Shaffer, and
Michael Wolf.)
Thursday, October 14th
Linkage
Disequilibrium Gene Mapping
Ingileif B. Hallgrimsdottir
Department of Statistics, UC Berkeley
If we assume that a disease causing
variant arose by mutation in an individual several generations ago then
many of the individuals in the population that are affected with the
disease today can be assumed to be descendants of that said
ancestor. They will thus not only share the gene variant but a
segment of the ancestral haplotype around the locus - or in other words
we expect to observe linkage disequilibrium (LD) around the trait
locus. We can utilize this LD to map the gene by doing a
case-control study where we search for haplotypes that are shared in
excess among the
cases.
However
such studies for gene mapping require a very dense set of markers and
thus it has not been feasible, until recently, to dogenomewide scans,
and LD mapping has mainly been used for fine-mapping after a locus or
region has been identified with linkage analysis. The methods
developed to do fine-mapping through identifying shared haplotypes
include e.g. DHSmap (McPeek and Strahls, 1999) and Blade (Liu etal.
2000). Unfortunately they can only be used if the number of
markers is relatively small and so are not applicable to genomewide
studies.
We
have developed a new non-parametric method which can both be used to do
a genomewide association scan, and for fine-mapping. We use the
algorithm presented in Haplotype Pattern Mining (HPM) (Toivonen et al.
2000) to search for shared patterns (haplotypes). But we also
provide a estimates of possible ancestral haplotypes and propose a new
sharing statistic based on haplotype sharing. We use a
permutation test to assess the significance of the sharing statistic at
each marker. A comparison to existing methods will be given.
Statistical
issues in the analysis of two-dimensional difference gel
electrophoresis experiments
Dr. Imola K. Fodor
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
Two-dimensional
difference gel electrophoresis (2D DIGE) is a technology to measure
simultaneously the expressions of thousands of proteins in different
biological samples. Many issues encountered in the analysis of 2D
DIGE data are similar to problems that arise in the analysis of
microarray experiments: proper experimental design, normalization of
the data, multiple hypothesis testing, and the quest for improved test
statistics that exploit the common information across the
proteins. In addition to the problems shared with the microarray
community, the analysis of 2D DIGE data presents additional
difficulties in detecting the spots and matching them across the gels.
We describe the
basic 2D DIGE methodology and the state-of-the-art data analysis tools
available to experimenters. We argue that additional data preprocessing
and improved statistical tests may lead to more realistic assessments
of differential expression of proteins than the conclusions based on
current tools.
This is joint
work with David O. Nelson.
Searching
for anamolous gene expression in brain tissue samples of
Alzheimer patients
Professor Alan E. Hubbard
Divisions of Environmental Health Sciences
and Biostatistics,
UC Berkeley
Finding genetic
markers for Alzheimer's disease (AD) has been difficult due to the
complexity of the disease and the overlap of at least its early-stage
markers with normal aging. Another related complexity is that
clinically normal subjects can exhibit considerable AD like pathology,
making arbitrary the criteria for distinguishing subjects with normal
aging, mild cognitive impairment, or incipient AD in the absence of
clinical data. In this study, we attempt to find gene expression
and proteomic markers that distinguish subjects with AD from a normal
pool of subjects. Data comes from the Religious Order Study (ROS)
collecting longitudinal data on around 900 individuals at more 40
seminaries and nunneries. The tissues obtained are from
individuals that have a complete pre-clinical and clinical history, and
complete neuropathological profile. The tissue (extracted for
RNA, QC'd, and amplified) for our study comes from frontal cortex
collected after death. The final analysis includes a
pooled-normal (control) samples and 32 patients who suffered from
Alzheimer disease; the relative gene expressions (patient vs.
pooled control) were examined using cDNA microarrays. In
addition, for a subset of both the Alzheimer patients and controls,
protein expression data was collected using gel-based proteomics,
including large-format 2-D gels, pre-fractionation techniques,
multiplexed fluorescent protein detection and orthogonal MALDI-TOF mass
spectrometry.
It is as least plausible, that different mechanisms will characterize
the disease of different sub-groups of patients. Practically, we
want to allow for markers that only characterize sub-groups of AD
patients, and not necessarily the whole target population. For
example, one might expect for some genes, that gene expression will be
different among normal and diseased subjects only for a subset of
diseased subjects. Thus, though using the mean expression might
be useful (and work if at least a significant portion of the subjects
have anomalous expression) it also can be insensitive to anomalous
expression in small subgroups. However, we propose a simple
statistical approach using bootstrapping to control the various error
rates of interest (e.g., family-wise error rate); an approach based on
the previous work of Dudoit, et al. (2004). The modest
innovation comes down to the choice of the test statistic, in our case,
quantiles. For example, we wish to find those genes that have at
least 25% of the subjects significantly differentially over-expressed
we can then use a test on the 0.75 quantile of expression versus some
(arbitrarily) chosen null value. Given the more complicated
nature of the proteomics data, the solution is itself more complicated
but based on the same basic procedure. Using this method of finding
differentially expressed genes and proteins, we also examine
relationships among these selected genes and proteins.
Thursday, November 4th
A
Computational Method to Detect Epistatic Effects Contributing to a
Quantitative Trait
Professor Philip Hanlon
Department of Mathematics, University of Michigan, Ann Arbor
We will discuss
a random walk based algorithm to detect epistatic effects contributing
to a quantitative trait. We will begin with the motivating
problem for this work - analysis of the UM-HET dataset. We
will then present the algorithm and discuss performance of the
algorithm when applied to synthetic datasets. We will conclude
with findings obtained when the algorithm was applied to the UM-HET
dataset.
Thursday, November 18th
The
Biostatistics Core of the 13th International HLA Working Group
Professor Glenys Thomson
Department of Integrative Biology, UC Berkeley
The 13th
International HLA Workshop provided for the first time a complete
definition of all allelic variation for the classical HLA class I (A,
B, and C) and II (DRB1, DQA1, DQB1, and DPB1) genes across ethnic
groups (Anthropology/ Human Diversity project). We implemented an
analysis package (PyPop) for comprehensive analyses of HLA multi-locus
population genetic variation. A primary feature of the package is that
it allows integration of statistics across large numbers of data-sets.
We completed a survey of data from 96 populations contributed to the
Anthropology/ Human Diversity project. This initial survey allowed us
to characterize levels of variation, test for natural selection using
the Ewens Watterson homozygosity test, quantify population
differentiation, and characterize multi-locus variation.
Using a matched design study we examined whether HLA region genes
additional to the known peptide-presenting molecules (the classical
class I and II genes) contribute to disease (type 1 diabetes,
rheumatoid arthritis, celiac disease, narcolepsy, and ankylosing
spondilitis) (HLA and Disease project). The analytical strategies
involved stratification techniques to remove the effects of linkage
disequilibrium with the class I or class II genes known to be directly
involved in disease susceptibility. Statistical analyses were then
based on examining variation in eight HLA region microsatellite loci
using genotype matched cases/controls, the homozygous parent TDT, and
the haplotype method. The results for all diseases from our studies,
and from the literature, while implicating additional genes in the HLA
region, show extensive heterogeneity; this is reminiscent of non-HLA
genes in complex diseases.
Links
PyPop:
http://allele5.biol.berkeley.edu/pypop/
dbMHC:
http://www.ncbi.nlm.nih.gov/projects/mhc/
Anthropology data (part of dbMHC): http://www.ncbi.nlm.nih.gov/projects/mhc/ihwg.fcgi?ID=9&cmd=PRJOV
Thursday, December 2nd
Multiple
Testing Procedures for Controlling Tail Probability Error Rates:
Comparison and Application
Merrill D. Birkner
Division of Biostatistics, UC Berkeley
This
presentation will focus on various marginal and joint multiple testing
procedures (MTP) for controlling the generalized family wise error rate
gFWER) and the tail probability of the proportion of false positives
(TPPFP). The techniques which will be compared are the
marginal MTPs proposed by Lehmann and Romano (2003) as well as the
general augmentation procedures proposed by van der Laan et al.
(2004). Augmentations of the following FWER-controlling MTPs will
be considered: marginal single step-Bonferroni and step-down Holm
procedures and joint single-step maxT procedure. The various gFWER- and
TPPFP-controlling procedures will be compared by simulation. Finally, a
brief application of joint multiple testing procedures to HIV-1
sequence data will be presented.
Thursday, December 9th
Designing
Estimators for Low Level Expression Analysis
Earl Hubbell
Affymetrix
The
analysis of gene expression using oligonucleotide arrays commonly
requires estimating the expression level of a transcript using
information from multiple probes. Many transcripts are expressed at
such low levels that the nonspecific hybridization is a significant
proportion of the observed probe intensity, and so it is an interesting
problem to design estimators that function well on transcripts that
have concentrations near or at zero. Working from simple assumptions
about the behavior of probes, PLIER is a M-estimator model-based
framework for finding expression estimates that is designed to handle
near-background probe intensities well with minimal positive bias to
the results. While the estimates from PLIER are by design not variance
stabilized, PLIER shows good performance at detecting differential
change, and can be variance stabilized by standard means.
Slides
http://mbi.osu.edu/2004/ws1materials/hubbell.ppt
http://www.affymetrix.com/corporate/events/seminar/microarray_workshop.affx