Poster Session

Sunday, August 15th



The Application of a Single-Step Multiple Testing Procedure to Detect Genotype/Phenotype Associations
Merrill Birkner (1),  Mélanie Courtine (2), Sandrine Dudoit (1), Karine Clément (2), Mark van der Laan (1), and Jean-Daniel Zucker  (2)
(1) Division of Biostatistics, University of California, Berkeley
(2) LIM\&BIO, Université Paris Nord  & Université Paris 6 et Service de Médecine et Nutrition

Slides (PDF)

Obesity is a multifactorial disease, which by definition, could be caused and/or influenced by multiple genetic and environmental factors, which act in gene-gene or gene-environment interactions. In order to assess the genetic composition of specific individuals, Single Nucleotide Polymorphisms, or SNPs, are often studied.  These binary markers could reveal various causes of obesity and other obesity related phenotypes.  The data for these analyses was obtained from the ObeLinks Project (LIM\&BIO).  These analyses consisted of combining unsupervised learning and multiple testing methods to analyze composite SNP genotypes and phenotypes. Unsupervised learning methods (Galois lattices) were applied to recode individual SNP genotypes into multilocus composite genotypes, in order to facilitate the detection of SNP interaction effects on the phenotype. The phenotypes that were analyzed in this dataset included a broad set of indices often measured when studying obesity.  We used single step multiple testing techniques to test the significance of these composite genotypes in connection to the various phenotypes.  The multiple testing procedures were applied to separately control Family Wise Error Rate (FWER), generalized Family Wise Error Rate (gFWER), and the Tail Probability of the Proportion of False Positives (TPPFP) [Dudoit et al. 2004 and van der Laan et al. 2004].  The combination of such methods have shown to be useful tools in assessing the potential genotype/phenotype associations found in connection with the disease of obesity in this population.  These methods could also prove to be useful when analyzing other genotypic/phenotypic relationships in subsequent biological studies.


Software design for loss-based estimation with the D/S/A-algorithm
Kasper D. Hansen, Visiting Student Researcher
Division of Biostatistics, University of California, Berkeley
                                                                               
Slides (PDF)

Consider the general loss-based estimation methodology proposed by van der Laan and Dudoit (2003) for estimator construction, selection, and performance assessment. The main ingredients of the estimation road map are: a parameterization of the parameter space in terms of a set of basis functions, a loss function, and a set of so-called D/S/A-moves. In this estimation framework, the parameter of interest is defined as the risk minimizer for the specified loss function over the parameter space of interest. Candidate estimators are generated based on this (or possibly another) loss function using a deletion/substitution/addition (D/S/A) algorithm to search over the parameter space. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator. This general estimation framework encompasses a number of problems that have traditionally been treated separately in the statistical literature, such as multivariate outcome prediction and density estimation based on possibly censored data.
                                                                               
Applications to genomic data analysis include the prediction of biological and clinical outcomes (possibly censored) using microarray measures of gene expression, the identification of regulatory motifs in DNA sequences, and genetic mapping with single nucleotide polymorphisms (SNP).
                                                                               
We address design issues related to the software implementation of this general loss-based estimation methodology in an R package. A flexible implementation should allow users to define and supply their own inputs for the main components of the road map, which may be high-level entities.
                                                                              
Joint work with: Sandrine Dudoit and Mark van der Laan.
Related technical reports:
         #124, 125, 126, 130, 135, 137, 142, 143
         Division of Biostatistics, UC Berkeley
         http://www.bepress.com/ucbbiostat
                                                                               
Related talks and posters at BIRS Workshop:
        Mark van der Laan
        Blythe Durbin
        Sandra Sinisi
                                                                               

An Empirical Bayes Approach for Interval Mapping of Expression Trait Loci (ETL) Data
Christina Kendziorski
and Meng Chen
Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison

Genetic linkage analysis has traditionally focused on mapping loci that affect one or more phenotypes. With microarray technology, we can consider mapping global patterns of gene expression, treating messenger RNA transcript abundances as quantitative traits.  A number of groups have recently applied traditional QTL methods to the problem of mapping mRNA abundance measurements by considering each individual transcript as a quantitiative trait.  However, most QTL mapping methods were developed to address the case where a small number of traits (oftentimes, just one) are being mapped. In expression trait loci (ETL) mapping, thousands of traits are considered simultaneously and the repeated  application of individual tests is not the most efficient strategy. Here we present an empirical Bayes modeling approach to enable ETL interval mapping. The inefficiency of the single trait method and the utility of the proposed approach are demonstrated using simulations and data from an F2 mouse cross in a study of diabetes.


Distribution for Gene Expression Data
Elizabeth Purdom
and Susan Holmes
Department of Statistics, Stanford University

Slides (PDF)

The distribution of normalized gene expressions, while similar for different arrays, is often far from normal regardless of the normalization methods. Rather, the distribution tends toward heavy tails and asymmetry of varying degrees. The Asymmetric Laplace distribution (Kotz et al., 2001) generalizes the heavy-tailed Laplace (or double exponential distribution) to allow for asymmetry and gives a good fit for cDNA gene expression data. Having this reasonable parametric model for the distribution within arrays allows more detailed examination of methods of analysis. The Asymmetric Laplace gives a great deal of interpretability to the null distribution of gene expression and thus allows for evaluating parameters in normalization models. And while the distribution within an array does not give the gene's distribution across arrays,  representations of the Asymmetric Laplace distribution suggest plausible alternatives for the distribution of a single gene expression where sample size is small and parametric models could give greater power.


The D/S/A algorithm in learning with applications in genomics
Sandra Sinisi and Mark van der Laan
Division of Biostatistics, University of California, Berkeley

Slides (PDF)

The Deletion/Substitution/Addition (D/S/A) algorithm is a novel regression methodology based on an estimation road map proposed by van der Laan and Dudoit (2003), which results in minimax adaptive estimators. The road map for estimation includes: defining the parameter of interest as the risk minimizer for a suitable loss function; parameterizing the parameter space in terms of tensor products of basis functions; constructing candidate estimators as minimizers of the empirical risk over subspaces of the parameter space indexed by fine tuning constants; and using cross-validation to select among candidate estimators. In order to minimize the empirical risk over a parameter space indexed by a dimension,  we have constructed the D/S/A algorithm. The D/S/A algorithm is an aggressive and flexible algorithm for generating a sequence of index sets, or subset of features, according to three types of moves for the elements in each index set: deletions, substitutions, and additions. The algorithm is completely defined by the choice of: loss function, the choice of basis functions, and the sets of moves. As a result, it can be designed to handle a range of prediction problems. Here, it is used for univariate prediction and described in the context of polynomial basis functions. Finally, the method is applied to a yeast data set to identify transcription factor binding sites.