PH 292 Seminar abstracts

PH 292

Statistics and Genomics Seminar

Fall 2004

Thursday, September 16th

Quantification and Visualization of LD Patterns and Identification of Haplotype Blocks
Yan Wang
Division of Biostatistics, UC Berkeley

Classical measures of linkage disequilibrium (LD) between two loci, based only on the joint distribution of alleles at these loci, present noisy patterns. In this work, we propose a new distance-based LD measure, R, which takes into account multilocus
haplotypes around the two loci in order to exploit information from neighboring loci. The LD measure R yields a matrix of pairwise distances between markers, based on the correlation between the lengths of shared haplotypes among chromosomes around these markers. Data analysis demonstrates that visualization of LD patterns through the R matrix reveals more deterministic patterns, with much less noise, than using classical LD measures. Moreover, the patterns are highly compatible with recently suggested models of haplotype block structure. We propose to apply the new LD measure to define haplotype blocks through cluster analysis. Specifically, we present a distance-based clustering algorithm, DHPBlocker, which performs hierarchical partitioning of an ordered sequence of markers into disjoint and adjacent blocks with a hierarchical structure. The proposed method integrates information on the two main existing criteria in defining haplotype blocks, namely, LD and haplotype diversity, through the use of silhouette width and description length as cluster validity measures, respectively. The new LD measure and clustering procedure are applied to single nucleotide polymorphism (SNP) datasets from the human 5q31 region (Daly et al. 2001) and the class II region of the human major histocompatibility complex (Jeffreys et al. 2001). Our results are in good agreement with published results. In addition, analyses performed on different subsets of markers indicate that the method is robust with regards to the allele frequency and density of the genotyped markers. Unlike previously proposed methods, our new cluster-based method can uncover hierarchical relationships among blocks and can be applied to polymorphic DNA markers or amino acid sequence data.

Reference:

Y. Wang and S. Dudoit (2004). Quantification and visualization of LD patterns and identification of haplotype blocks. Technical Report #150, Division of Biostatistics, UC Berkeley.
http://www.bepress.com/ucbbiostat/paper150

Thursday, September 23rd

Multiple Testing Procedures for Control of the Generalized Family Wise Error Rate and Proportion of False Positives
Professor Mark van der Laan
Division of Biostatistics, UC Berkeley

A fundamental tool in the analysis of genomic (and, in general, high dimensional) data is a valid multiple testing procedure controlling a specified Type-I error rate.
In a series of articles Pollard, van der Laan (2004), Dudoit et al. (2003) and van der Laan et al. (2003), we have provided, for general hypotheses and test-statistics, single-step and step-down resampling based multiple testing procedures asymptotically controlling family wise error at a specified level alpha. The proposed procedures differ from the currently used single step and step down procedures by the choice of null distribution, and as a consequence they could be shown to provide asymptotic control of family wise error, in general (avoiding the need for the subset pivotality condition). In this talk, we discuss the choice of null distribution and its bootstrap estimate, and we show that any multiple testing procedure (asymptotically) controlling family wise error at level alpha can be augmented into 1) a multiple testing procedure (asymptotically) controlling the generalized family wise error (GFWE) (i.e., the probability of having more than k false positives) at level alpha and 2) a multiple testing procedure (asymptotically) controlling the proportion of false positives (PFP) at user supplied proportion q at level alpha. Given the multiple testing procedure controlling FWE, our proposed procedures only involves very minor additional computations, and the adjusted p-values of our procedures are trivial functions of the adjusted p-values of the FWE-procedure. We also show some simulation results comparing different proposed multiple testing procedures.

Joint work with:
Sandrine Dudoit, Katherine Pollard, Merrilll Birkner
Division of Biostatistics, University of California, Berkeley

References:

K.S. Pollard, M.J. van der Laan (2004), Choice of Null distribution in resampling based multiple testing, Journal of Statistical Planning and Inference, Volume 125, 85-101.

S. Dudoit, M.J. van der Laan, K.S Pollard (2004), Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates, Statistical Applications in Genetics and Molecular Biology Vol. 3: No. 1, Article 13.
http://www.bepress.com/sagmb/vol3/iss1/art13

M.J. van der Laan, S. Dudoit, K.S. Pollard (2004), Augmentation Procedures for Control of the Generalized Family-Wise Error Rate and Tail Probabilities for the Proportion of False Positives, Statistical Applications in Genetics and Molecular Biology Vol. 3: No. 1, Article 15.
http://www.bepress.com/sagmb/vol3/iss1/art15

M.J. van der Laan, S. Dudoit, K.S Pollard (2004), Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate, Statistical Applications in Genetics and Molecular Biology} Vol. 3: No. 1, Article 14.
http://www.bepress.com/sagmb/vol3/iss1/art14

Thursday, September 30th

Microarray Gene Expression Data with Linked Survival Phenotypes: Diffuse Large-B-Cell Lymphoma Revisited
Professor Mark Segal
Department of Epidemiology and Biostatistics, University of California, San Francisco

Regression analyses, wherein (continuous) phenotypes are related to gene expression obtained from microarray experiments, must accommodate defining attributes of such data: high dimensional covariates (genes), sparse samples (arrays) (p >> n), and complex between-gene dependence. Censored survival phenotypes are additionally complicating. A series of high profile studies relating gene expression to post-therapy DLCBL survival provide examples. I initially focus on the "lymphochip"-expression data and analysis of Rosenwald et al., (NEJM, 2002). After describing relationships between the analyses performed and gene harvesting (Hastie et al., Genome Biology, 2001) and indicating the potential for artifactual solutions, I argue for the utility of regularized approaches, in particular LARS-Lasso (Efron et al., Annals of Statistics, 2004). While these methods have been extended to the proportional hazards / partial likelihood setting, the resultant algorithms are computationally burdensome. I develop residual-based approximations that alleviate this burden yet perform comparably. I conclude by briefly discussing some cross-study comparisons and outlining possibilities for further work.

Thursday, October 7th

Error Control in Multiple Testing
Professor Joseph P. Romano
Department of Statistics, Stanford University

Consider the multiple testing problem of testing s null hypotheses. In this talk, various stepwise methods are constructed under constraints of various notions of error control.

In the first part of the talk, we assume a parametric family of distributions which satisfies a certain monotonicity assumption. Attention is restricted to procedures that control the familywise error rate (FWE) in the strong sense and which satisfy a monotonicity condition. Under these assumptions, we prove certain maximin optimality results for the well-known stepdown and stepup procedures.

In the second part, we consider the general problem of contructing methods that control the FWE in a general (nonparametric) setting. In order to improve upon the Bonferroni method or Holm's (1979) stepdown method, Westfall and Young (1993) make effective use of resampling to construct stepdown methods that implicitly estimate the dependence structure of the test statistics. However, their methods depend on an assumption called subset pivotality. We will show how to construct methods that control the FWE, both in finite and large samples. A key ingredient is monotonicity of critical values which allows one to effectively reduce the multiple testing problem of controlling the FWE to the single testing problem of controlling the probability of a Type 1 error. Resampling methods are then incorporated into the stepwise schemes.

In the final part of the talk, alternative measures of error control will be discussed. Explicit construction of stepwise procedures for these alternative methods will be presented as well.

(This talk is based on collaborations with Erich Lehmann, Juliet Shaffer, and Michael Wolf.)

Thursday, October 14th

Linkage Disequilibrium Gene Mapping
Ingileif B. Hallgrimsdottir
Department of Statistics, UC Berkeley

If we assume that a disease causing variant arose by mutation in an individual several generations ago then many of the individuals in the population that are affected with the disease today can be assumed to be descendants of that said ancestor. They will thus not only share the gene variant but a segment of the ancestral haplotype around the locus - or in other words we expect to observe linkage disequilibrium (LD) around the trait locus. We can utilize this LD to map the gene by doing a case-control study where we search for haplotypes that are shared in
excess among the cases.

However such studies for gene mapping require a very dense set of markers and thus it has not been feasible, until recently, to dogenomewide scans, and LD mapping has mainly been used for fine-mapping after a locus or region has been identified with linkage analysis. The methods developed to do fine-mapping through identifying shared haplotypes include e.g. DHSmap (McPeek and Strahls, 1999) and Blade (Liu etal. 2000). Unfortunately they can only be used if the number of markers is relatively small and so are not applicable to genomewide studies.

We have developed a new non-parametric method which can both be used to do a genomewide association scan, and for fine-mapping. We use the algorithm presented in Haplotype Pattern Mining (HPM) (Toivonen et al. 2000) to search for shared patterns (haplotypes). But we also provide a estimates of possible ancestral haplotypes and propose a new sharing statistic based on haplotype sharing. We use a permutation test to assess the significance of the sharing statistic at each marker. A comparison to existing methods will be given.

Thursday, October 21st

Statistical issues in the analysis of two-dimensional difference gel
electrophoresis experiments
Dr. Imola K. Fodor
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory

Two-dimensional difference gel electrophoresis (2D DIGE) is a technology to measure simultaneously the expressions of thousands of proteins in different biological samples. Many issues encountered in the analysis of 2D DIGE data are similar to problems that arise in the analysis of microarray experiments: proper experimental design, normalization of the data, multiple hypothesis testing, and the quest for improved test statistics that exploit the common information across the proteins. In addition to the problems shared with the microarray community, the analysis of 2D DIGE data presents additional difficulties in detecting the spots and matching them across the gels.

We describe the basic 2D DIGE methodology and the state-of-the-art data analysis tools available to experimenters. We argue that additional data preprocessing and improved statistical tests may lead to more realistic assessments of differential expression of proteins than the conclusions based on current tools.

This is joint work with David O. Nelson.

Thursday, October 28th

Searching for anamolous gene expression in brain tissue samples of
Alzheimer patients
Professor Alan E. Hubbard
Divisions of Environmental Health Sciences and Biostatistics, UC Berkeley

Finding genetic markers for Alzheimer's disease (AD) has been difficult due to the complexity of the disease and the overlap of at least its early-stage markers with normal aging. Another related complexity is that clinically normal subjects can exhibit considerable AD like pathology, making arbitrary the criteria for distinguishing subjects with normal aging, mild cognitive impairment, or incipient AD in the absence of clinical data. In this study, we attempt to find gene expression and proteomic markers that distinguish subjects with AD from a normal pool of subjects. Data comes from the Religious Order Study (ROS) collecting longitudinal data on around 900 individuals at more 40 seminaries and nunneries. The tissues obtained are from individuals that have a complete pre-clinical and clinical history, and complete neuropathological profile. The tissue (extracted for RNA, QC'd, and amplified) for our study comes from frontal cortex collected after death. The final analysis includes a pooled-normal (control) samples and 32 patients who suffered from Alzheimer disease; the relative gene expressions (patient vs. pooled control) were examined using cDNA microarrays. In addition, for a subset of both the Alzheimer patients and controls, protein expression data was collected using gel-based proteomics, including large-format 2-D gels, pre-fractionation techniques, multiplexed fluorescent protein detection and orthogonal MALDI-TOF mass spectrometry.

It is as least plausible, that different mechanisms will characterize the disease of different sub-groups of patients. Practically, we want to allow for markers that only characterize sub-groups of AD patients, and not necessarily the whole target population. For example, one might expect for some genes, that gene expression will be different among normal and diseased subjects only for a subset of diseased subjects. Thus, though using the mean expression might be useful (and work if at least a significant portion of the subjects have anomalous expression) it also can be insensitive to anomalous expression in small subgroups. However, we propose a simple statistical approach using bootstrapping to control the various error rates of interest (e.g., family-wise error rate); an approach based on the previous work of Dudoit, et al. (2004). The modest innovation comes down to the choice of the test statistic, in our case, quantiles. For example, we wish to find those genes that have at least 25% of the subjects significantly differentially over-expressed we can then use a test on the 0.75 quantile of expression versus some (arbitrarily) chosen null value. Given the more complicated nature of the proteomics data, the solution is itself more complicated but based on the same basic procedure. Using this method of finding differentially expressed genes and proteins, we also examine relationships among these selected genes and proteins.

Thursday, November 4th

A Computational Method to Detect Epistatic Effects Contributing to a Quantitative Trait
Professor Philip Hanlon
Department of Mathematics, University of Michigan, Ann Arbor

We will discuss a random walk based algorithm to detect epistatic effects contributing to a quantitative trait. We will begin with the motivating problem for this work - analysis of the UM-HET dataset. We will then present the algorithm and discuss performance of the algorithm when applied to synthetic datasets. We will conclude with findings obtained when the algorithm was applied to the UM-HET dataset.

Thursday, November 18th

The Biostatistics Core of the 13th International HLA Working Group
Professor Glenys Thomson
Department of Integrative Biology, UC Berkeley

The 13th International HLA Workshop provided for the first time a complete definition of all allelic variation for the classical HLA class I (A, B, and C) and II (DRB1, DQA1, DQB1, and DPB1) genes across ethnic groups (Anthropology/ Human Diversity project). We implemented an analysis package (PyPop) for comprehensive analyses of HLA multi-locus population genetic variation. A primary feature of the package is that it allows integration of statistics across large numbers of data-sets. We completed a survey of data from 96 populations contributed to the Anthropology/ Human Diversity project. This initial survey allowed us to characterize levels of variation, test for natural selection using the Ewens Watterson homozygosity test, quantify population differentiation, and characterize multi-locus variation.

Using a matched design study we examined whether HLA region genes additional to the known peptide-presenting molecules (the classical class I and II genes) contribute to disease (type 1 diabetes, rheumatoid arthritis, celiac disease, narcolepsy, and ankylosing spondilitis) (HLA and Disease project). The analytical strategies involved stratification techniques to remove the effects of linkage disequilibrium with the class I or class II genes known to be directly involved in disease susceptibility. Statistical analyses were then based on examining variation in eight HLA region microsatellite loci using genotype matched cases/controls, the homozygous parent TDT, and the haplotype method. The results for all diseases from our studies, and from the literature, while implicating additional genes in the HLA region, show extensive heterogeneity; this is reminiscent of non-HLA genes in complex diseases.

Links

PyPop:
http://allele5.biol.berkeley.edu/pypop/

dbMHC:
http://www.ncbi.nlm.nih.gov/projects/mhc/

Anthropology data (part of dbMHC): http://www.ncbi.nlm.nih.gov/projects/mhc/ihwg.fcgi?ID=9&cmd=PRJOV

Thursday, December 2nd

Multiple Testing Procedures for Controlling Tail Probability Error Rates:
Comparison and Application
Merrill D. Birkner
Division of Biostatistics, UC Berkeley

This presentation will focus on various marginal and joint multiple testing procedures (MTP) for controlling the generalized family wise error rate gFWER) and the tail probability of the proportion of false positives (TPPFP). The techniques which will be compared are the marginal MTPs proposed by Lehmann and Romano (2003) as well as the general augmentation procedures proposed by van der Laan et al. (2004). Augmentations of the following FWER-controlling MTPs will be considered: marginal single step-Bonferroni and step-down Holm procedures and joint single-step maxT procedure. The various gFWER- and TPPFP-controlling procedures will be compared by simulation. Finally, a brief application of joint multiple testing procedures to HIV-1 sequence data will be presented.

Thursday, December 9th

Designing Estimators for Low Level Expression Analysis
Earl Hubbell
Affymetrix

The analysis of gene expression using oligonucleotide arrays commonly requires estimating the expression level of a transcript using information from multiple probes. Many transcripts are expressed at such low levels that the nonspecific hybridization is a significant proportion of the observed probe intensity, and so it is an interesting problem to design estimators that function well on transcripts that have concentrations near or at zero. Working from simple assumptions about the behavior of probes, PLIER is a M-estimator model-based framework for finding expression estimates that is designed to handle near-background probe intensities well with minimal positive bias to the results. While the estimates from PLIER are by design not variance stabilized, PLIER shows good performance at detecting differential change, and can be variance stabilized by standard means.

Slides

http://mbi.osu.edu/2004/ws1materials/hubbell.ppt

http://www.affymetrix.com/corporate/events/seminar/microarray_workshop.affx