Poster Session
Sunday,
August 15th
The
Application of a Single-Step Multiple Testing Procedure to Detect
Genotype/Phenotype Associations
Merrill Birkner (1),
Mélanie
Courtine (2), Sandrine Dudoit (1), Karine Clément (2), Mark van
der Laan (1), and Jean-Daniel
Zucker (2)
(1) Division of Biostatistics, University of California, Berkeley
(2) LIM\&BIO, Université Paris Nord &
Université Paris 6 et Service de Médecine et Nutrition
Slides (PDF)
Obesity is a multifactorial disease, which by definition, could be
caused and/or influenced by multiple genetic and environmental factors,
which act in gene-gene or gene-environment interactions. In order to
assess the genetic composition of specific individuals, Single
Nucleotide Polymorphisms, or SNPs, are often studied. These
binary markers could reveal various causes of obesity and other obesity
related phenotypes. The data for these analyses was obtained from
the ObeLinks Project (LIM\&BIO). These analyses consisted of
combining unsupervised learning and multiple testing methods to analyze
composite SNP genotypes and phenotypes. Unsupervised learning methods
(Galois lattices) were applied to recode individual SNP genotypes into
multilocus composite genotypes, in order to facilitate the detection of
SNP interaction effects on the phenotype. The phenotypes that were
analyzed in this dataset included a broad set of indices often measured
when studying obesity. We used single step multiple testing
techniques to test the significance of these composite genotypes in
connection to the various phenotypes. The multiple testing
procedures were applied to separately control Family Wise Error Rate
(FWER), generalized Family Wise Error Rate (gFWER), and the Tail
Probability of the Proportion of False Positives (TPPFP) [Dudoit et al.
2004 and van der Laan et al. 2004]. The combination of such
methods have shown to be useful tools in assessing the potential
genotype/phenotype associations found in connection with the disease of
obesity in this population. These methods could also prove to be
useful when analyzing other genotypic/phenotypic relationships in
subsequent biological studies.
Software
design for loss-based estimation with the D/S/A-algorithm
Kasper D. Hansen, Visiting
Student Researcher
Division of Biostatistics, University of California, Berkeley
Slides (PDF)
Consider the general loss-based estimation methodology proposed by van
der Laan and Dudoit (2003) for estimator construction, selection, and
performance assessment. The main ingredients of the estimation road map
are: a parameterization of the parameter space in terms of a set of
basis functions, a loss function, and a set of so-called D/S/A-moves.
In this estimation framework, the parameter of interest is defined as
the risk minimizer for the specified loss function over the parameter
space of interest. Candidate estimators are generated based on this (or
possibly another) loss function using a deletion/substitution/addition
(D/S/A) algorithm to search over the parameter space. Cross-validation
is applied to select an optimal estimator among the candidates and to
assess the overall performance of the resulting estimator. This general
estimation framework encompasses a number of problems that have
traditionally been treated separately in the statistical literature,
such as multivariate outcome prediction and density estimation based on
possibly censored data.
Applications to genomic data analysis include the prediction of
biological and clinical outcomes (possibly censored) using microarray
measures of gene expression, the identification of regulatory motifs in
DNA sequences, and genetic mapping with single nucleotide polymorphisms
(SNP).
We address design issues related to the software implementation of this
general loss-based estimation methodology in an R package. A flexible
implementation should allow users to define and supply their own inputs
for the main components of the road map, which may be high-level
entities.
Joint work with: Sandrine Dudoit and Mark van der Laan.
Related technical reports:
#124, 125, 126, 130,
135, 137, 142, 143
Division of
Biostatistics, UC Berkeley
http://www.bepress.com/ucbbiostat
Related talks and posters at BIRS Workshop:
Mark van der Laan
Blythe Durbin
Sandra Sinisi
An
Empirical Bayes Approach for Interval Mapping of Expression Trait Loci
(ETL) Data
Christina Kendziorski and Meng Chen
Department of Biostatistics and Medical Informatics, University
of Wisconsin, Madison
Genetic linkage analysis has traditionally focused on mapping loci that
affect one or more phenotypes. With microarray technology, we can
consider mapping global patterns of gene expression, treating messenger
RNA transcript abundances as quantitative traits. A number of
groups have recently applied traditional QTL methods to the problem of
mapping mRNA abundance measurements by considering each individual
transcript as a quantitiative trait. However, most QTL mapping
methods were developed to address the case where a small number of
traits (oftentimes, just one) are being mapped. In expression trait
loci (ETL) mapping, thousands of traits are considered simultaneously
and the repeated application of individual tests is not the most
efficient strategy. Here we present an empirical Bayes modeling
approach to enable ETL interval mapping. The inefficiency of the single
trait method and the utility of the proposed approach are demonstrated
using simulations and data from an F2 mouse cross in a study of
diabetes.
Distribution
for Gene Expression Data
Elizabeth Purdom and Susan Holmes
Department of Statistics, Stanford University
Slides (PDF)
The distribution of normalized gene expressions, while similar for
different arrays, is often far from normal regardless of the
normalization methods. Rather, the distribution tends toward heavy
tails and asymmetry of varying degrees. The Asymmetric Laplace
distribution (Kotz et al., 2001) generalizes the heavy-tailed Laplace
(or double exponential distribution) to allow for asymmetry and gives a
good fit for cDNA gene expression data. Having this reasonable
parametric model for the distribution within arrays allows more
detailed examination of methods of analysis. The Asymmetric Laplace
gives a great deal of interpretability to the null distribution of gene
expression and thus allows for evaluating parameters in normalization
models. And while the distribution within an array does not give the
gene's distribution across arrays, representations of the
Asymmetric Laplace distribution suggest plausible alternatives for the
distribution of a single gene expression where sample size is small and
parametric models could give greater power.
The
D/S/A algorithm in learning with applications in genomics
Sandra Sinisi and Mark van der
Laan
Division of Biostatistics, University of California, Berkeley
Slides (PDF)
The Deletion/Substitution/Addition (D/S/A) algorithm is a novel
regression methodology based on an estimation road map proposed by van
der Laan and Dudoit (2003), which results in minimax adaptive
estimators. The road map for estimation includes: defining the
parameter of interest as the risk minimizer for a suitable loss
function; parameterizing the parameter space in terms of tensor
products of basis functions; constructing candidate estimators as
minimizers of the empirical risk over subspaces of the parameter space
indexed by fine tuning constants; and using cross-validation to select
among candidate estimators. In order to minimize the empirical risk
over a parameter space indexed by a dimension, we have
constructed the D/S/A algorithm. The D/S/A algorithm is an aggressive
and flexible algorithm for generating a sequence of index sets, or
subset of features, according to three types of moves for the elements
in each index set: deletions, substitutions, and additions. The
algorithm is completely defined by the choice of: loss function, the
choice of basis functions, and the sets of moves. As a result, it can
be designed to handle a range of prediction problems. Here, it is used
for univariate prediction and described in the context of polynomial
basis functions. Finally, the method is applied to a yeast data set to
identify transcription factor binding sites.