Invited Talks



Inference & Epistemology in High Dimensional Biology
David B. Allison
Department of Biostatistics and Department of Nutrition Sciences, University of Alabama at Birmingham

Slides (PDF)

Graphical Modeling for Many Variables, and an Application to Genetic Regulatory Subnetworks
Peter Bühlmann
Seminar fur Statistik, ETH-Zurich

Slides (PDF)

We consider graphical modeling for describing conditional (in-)dependencies of p random variables X_1,...,X_p. The general goal is to estimate the conditional (in-)dependence graph from data of sample size n (e.g. n i.i.d. copies of X_1,...,X_p).
                                                                               
We are presenting two methods for estimating graphs for the case where p is large in comparison to n: mathematically, we will allow p = p_n = O(n^{r}) for any 0 < r < \infty. One method is based on the Lasso. It is computationally very efficient due to the convexity of the problem, and we show that it is asymptotically consistent for very high-dimensional but "sparse" graphs. In addition, we will briefly discuss a trade-off between optimal prediction and identifying the true model structure. The other method also works for the case where the true graph is "non-sparse". In general, this seems (or is) impossible. We propose to focus on a simpler concept of a graph, a so-called tri-graph. Estimation of (the simpler notion of) tri-graphs can be done by an exhaustive computation and a consistencyholds for p very large (without requiring "sparseness").
                                                                              
Our motivation for high-dimensional graphical modeling comes from studying a transcriptional gene regulatory subnetwork of two biosynthesis pathways from Arabidopsis Thaliana. The two pathways comprise p=40 genes and the current sample size of gene expression measurements is n=120. Some results will be presented (which still need further biological validation).


Improving the Reliability of Results from Genome-Wide Studies
Shelley Bull, Lei Sun, and LongYang Wu
(1) Department of Public Health Sciences, and Samuel Lunenfeld Research Institute of Mount Sinai Hospital, University of Toronto
(2) Department of Public Health Sciences, University of Toronto
(3) Samuel Lunenfeld Research Institute, University of Toronto

Slides (PDF)

The reliability of gene detection, the accuracy of locus-specific effect estimates, and failure to replicate initial claims of linkage or association have emerged as major concerns in genome-wide studies. While multiple testing methods are useful to control genome-wide type I error, they do not address the bias introduced into genetic-effect-parameter estimates by use of strict significance criteria. Some authors have argued that valid gene(locus)-specific parameter estimates can only be obtained in an independent sample; or have suggested the strategy of sample splitting. Statistical resampling techniques such as cross-validation and the bootstrap have been successfully employed to address over-fitting and variable selection bias in prognostic prediction models in clinical settings and to obtain accurate estimates of classification error in microarray gene expression studies. We show how to tailor these techniques to effect estimation in genome-wide studies. Under a simple model, we derive analytically the upward bias of the naive estimator and the loss of efficiency due to sample splitting, and propose three simple resampling-based estimators that can be applied to the original sample in general settings. I will present results to date in which we have demonstrated bias reduction in simulation studies of 1) a homogenous population with a single disease gene. and 2) a mixture of two populations with differing genetic loci, and outline some directions for further research.


Semantic web concepts and tools for bioinformatics
Vincent J. Carey
Harvard Medical School, Channing Laboratory

Slides (PDF)

The distributed and accumulative nature of modern research in bioinformatics and computational biology engenders new and challenging requirements for control of data fragmentation and for improvement of semantic evaluability of distributed data resources.  New information models and processing paradigms have emerged in computer science research on the semantic web.  This talk will provide background on the RDF (Resource Description Framework), OWL (Web Ontology Language) and LSID (Life Science Identifier) protocols. Applications to data representation, data amalgamation, and inference in bioinformatics will be illustrated with tools from the Bioconductor project.


Optimizing Neural Network Architecture Using the Deletion/Substitution/Addition Algorithm
Blythe Durbin
Division of Biostatistics, University of California, Berkeley

Slides (PDF)

Neural networks are a popular machine learning tool, particularly in applications such as prediction of protein secondary structure (Rost and Sander, 1993, Jones, 1999).  However, overfitting poses a serious obstacle to effective use of neural networks for these and other problems.  Due to the huge number of parameters in a typical neural network, one may obtain a network fit which perfectly predicts the training data yet fails to generalize to other data sets.   Overfitting may be avoided by altering the network topology so that some connections are removed, thus reducing the total number of parameters. In the area of secondary structure prediction, work has focused on optimizing the network architecture by hand based on subject-matter knowledge (Riis and Krogh, 1996).  We propose instead a method for selecting an optimal network architecture in a data-adaptive fashion using the Deletion/Substitution/Addition algorithm introduced in Sinisi and van der Laan (2004) and Molinaro and van der Laan (2004), and present results of this approach on simulated data.

References:

Jones, D. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195--202.
                                                                               
Molinaro, A. and van der Laan, M. (2004) A Deletion/Substitution/Addition algorithm for paritioning the covariate space in prediction.  Technical report, Division of Biostatistics, UC Berkeley. (In preparation).
 
Riis, S.K. and Krogh, A. (1996) Improving prediction of protein secondary structure using structured neural networks and multiple sequences alignments. J. Comp. Biol., 3, 163--183.
 
Rost, B. and Sander, C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584--599.
 
Sinisi, S. and van der Laan, M. (2004)  Loss-Based Cross-Validated  Deletion/Substitution/Addition Algorithms in Estimation. Technical report #143, Division of Biostatistics, UC Berkeley.


Bioconductor: An Overview and New Challenges
Robert Gentleman
Department of Biostatistics, Harvard School of Public Health

Slides (PDF)

An overview of the current state of the project, its goals and development strategy will be presented. A variety of  technologies, such as microarrays, array CGH, protein-protein interaction, and protein  mass spec will be covered. A number of existing challenges, needs and opportunities will be presented and discussed.


Statistics of Microarray Data for Understanding the Immune System and Cancer
Susan Holmes
Department of Statistics, Stanford University

Slides (PDF)

I will present a series of experiments in an ongoing collaboration with Peter Lee, an immunologist from Stanford medical school. We use QT-PCR after Agilent microarrays to find genes in T-cells that have differential expression in cancer patients. We have also been studying T-cell differentiation through expression patterns.                                                                               
Our studies make heavy use of multiple testing through the Bioconductor package, as well as the variance stabilization procedures initiated by B. Durbin and made available in Bioconductor by W. Huber.
                                                                              
One the statistical findings of these searches for subtle differential expression patterns has been that the relevant distribution for the transformed data has a Laplace distribution. In joint workwith Elizabeth Purdom, we have tried to understand the consequences of this discovery.


A combination of gene expression and proteomic data in brain tissue samples of Alzheimer patients:  using multiple testing and clustering on to find subsets of diseased subjects with anomalous gene/protein expressions
Alan E. Hubbard,
Simon Melov, David A. Bennet and Mary Lopez
Divisions of Environmental Health Sciences and Biostatistics, University of California, Berkeley

Slides (PDF)

Finding genetic markers for Alzheimer’s disease (AD) has been difficult due to the complexity of the disease and the overlap of at least its early-stage markers with normal aging.  Another relatedcomplexity is that clinically normal subjects can exhibit considerable AD like pathology,making arbitrary the criteria for distinguishing subjects with normal aging, mild cognitive impairment, or incipient AD in the absence of clinical data.  In this study, we attempt to find geneexpression and proteomic markers that distinguish subjects with AD from a normal pool of subjects.  Data comes from the Religious Order Study (ROS) collecting longitudinal data on around 900 individuals at more 40 seminaries and nunneries.  The tissues obtained are from individuals that have a complete pre-clinical and clinical history, and complete neuropathological profile.   The tissue (extracted for RNA, QC’d, and amplified) for our study comes from frontal cortex collected after death.  The final analysis includes a pooled-normal (control) samples and 32 patients who suffered from Alzheimer disease;  the relative gene expressions (patient vs. pooled control) were examined using cDNA microarrays.  In addition, for a subset of both the Alzheimer patients and controls, protein expression data was collected using gel-based proteomics, including large-format 2-D gels, pre-fractionation techniques, multiplexed fluorescent protein detection and orthogonal MALDI-TOF mass spectrometry.

It is as least plausible, that different mechanisms will characterize the disease of different sub-groups of patients.  Practically, we want to allow for markers that only characterize sub-groups of AD patients, and not necessarily the whole target population.  For example, one might expect for some genes, that gene expression will be different among normal and diseased subjects only for a subset of diseased subjects.  Thus, though using the mean expression might be useful (and work if at least a significant portion of the subjects have anomalous expression) it also can be insensitive to anomalous expression in small subgroups.  However, we propose a simple statistical approach using bootstrapping to control the various error rates of interest (e.g., family-wise error rate); an approach based on the previous work of Dudoit, et al. (2004).   The modest innovation comes down to the choice of the test statistic – in our case, quantiles.  For example, we wish to find those genes that have at least 25% of the subjects “significantly” differentially over-expressed – we can then use a test on the 0.75 quantile of expression versus some (arbitrarily) chosen null value.  Given the more complicated nature of the proteomics data, the solution is itself more complicated but based on the same basic procedure. Using this method of finding differentially expressed genes and proteins, we also examine relationships among these selected genes and proteins.


Another look at an ancient problem: finding the p-value of the llr statistic for multinomial distribution
Uri Keich

Department of Computer Science, Cornell University

The subject of estimating the p-value of the log-likelihood ratio statistic for multinomial distribution has been studied extensively in the statistical literature. The classical result on using a Chi^2 approximation breaks down when extreme values of the statistic are considered, a situation which is not uncommon in computational biology. Algorithms for numerically estimating the p-value were developed in the 80s and 90s by the computational statistics community. Unwittingly, the bioinformatics community has rediscovered them during the last few years. In this talk I will briefly review the development of the problem and present, I dare say, a new faster and more memory-frugal way to numerically estimate the llr p-value.


Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array  Data
Sündüz Keles

Departments of Biostatistics and Statistics, University of Wisconsin-Madison

Slides (PDF)

Cawley et al. (2004) have recently mapped the locations of binding sites for three transcription factors along human chromosomes 21 and 22 using ChIP-Chip experiments. ChIP-Chip experiments are a new approach to the genome-wide identification of transcription factor binding sites and consist of chromatin (Ch) immunoprecipitation (IP) of transcription factor-bound genomic DNA followed by high density oligonucleotide hybridization (Chip) of the IP-enriched DNA. We investigate the ChIP-Chip data structure and propose methods for inferring the location of transcription factor binding sites from these data. The proposed methods involve testing for each probe whether it is part of a bound sequence or not using a scan statistic that takes into account the spatial structure of the data. Different multiple testing procedures are considered for controlling the family-wise error rate and false discovery rate. A nested-Bonferroni adjustment, that is more powerful than the traditional Bonferroni adjustment when the test statistics are dependent, is discussed. Simulation studies show that taking into account the spatial structure of the data substantially improves the sensitivity of the multiple testing procedures. Application of the proposed methods to ChIP-Chip data for transcription factor p53 identified many potential target binding regions along human chromosomes 21 and 22. Among these identified regions, 18% fall within a 3kb vicinity of the 5'UTR of a known gene or CpG island, 31% fall between the codon start site and the codon end site of a known gene but not inside an exon. More than half of these potential target sequences contain the p53 consensus binding site or very close matches to it. Moreover, these target segments include the 13 experimentally verified p53 binding regions of Cawley et al. (2004), as well as 49 additional regions that show higher hybridization signal than these 13 experimentally verified regions.

Joint work with: Mark J. van der Laan, Sandrine Dudoit, and Simon E. Cawley.


To Pool or Not to Pool: A Question of Microarray Experimental Design
Christina Kendziorski
(1) and Rafael Irizarry (2)
(1) Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
(2) Department of Biostatistics, Johns Hopkins University
                                                                               

Statistical calculations have shown that pooling messenger RNA (mRNA) samples across subjects can reduce the effect of biological variability and in so doing reduce the number of arrays required in a microarray experiment. I will present results of an experiment designed to evaluate assumptions often made in these types of calculations. The impact of pooling on identifying differentially expressed genes will also be discussed.


Using GeneOntology structure and annotations in microarray analysis
Rafal Kustra
Department of Public Health Sciences, University of Toronto

I will talk about a multivariate modelling approach, mR3, which we developed for exploratory analysis of very high-dimensional experimental data. mR3 (Multivariate, Reduced-Rank and Regularized) is an extension of classical multivariate tools and provides easily interpretable results. It also comes with a fast algorithm for estimation and bootstrap inference. After describing the methodology I will suggest a general way to obtain better estimates from microarray analysis by involving the information contained in GO structure in analyzing microarray data and connect it to mR3. However, this approach can be extended to many other analytic domains currently used for microarrays, including classification, prediction and clustering.


Loss Based Cross-Validated Deletion/Substitution/Addition Algorithms in Learning: Applications in Genomics and Epidemiology
Mark J. van der Laan
Division of Biostatistics and Department of Statistics, University of California, Berkeley

Current applications in genomics and epidemiology concern high dimensional (and, possibly, time-dependent) data structures, and the questions of interest correspond typically with high dimensional parameters of interest. In such problems it is typically not possible to a priory pose a model allowing estimation at a parametric rate. We will present a general loss based estimation procedure, which is grounded by theory (e.g., minimax adaptive), and generalizes existing estimation problems. An application of this general methodology yields data adaptive algorithms for conditional mean estimation and conditional hazard/density estimation based on censored and uncensored data. We  apply the method to detect binding sites in yeast gene expression experiments, regress replication capacity of the HIV-virus on the sequence of the virus in a sample of HIV-infected patients, and regress phenotypes for obesity on SNP-profiles. We also discuss potential approaches for generic software development in R in the hope to obtain some input from the audience.
                                                                               
Joint work with: Sandrine Dudoit,  Sandra Sinisi, Annette Molinaro, Sündüz Keles, and Merrill Birkner


Efficient Algorithms for Bayesian Approaches to Sequence Alignment
Lior Pachter

Department of Mathematics, University of California, Berkeley

Slides (PDF)

The problem of sequence alignment remains a fundamental first step in utilizing multiple genomes for identifying functional elements. Several probabilistic models have been proposed for performing alignments (including the popular pair hidden Markov model), however the inference algorithms for biologically realistic models tend to be impractical for the the scale of problems encountered today. We will discuss several ideas for efficient sequence alignment within a probabilistic framework that can be applied to whole genome alignment.

Joint work with :Marina Alexandersson, Simon Cawley, Fumei Lam and Bernd Sturmfels.


Multiple hypothesis testing for high dimensional biological data
Katherine S. Pollard, PhD
University of California, Santa Cruz

Slides (PDF)
 
Multiple hypothesis testing problems arise in analysis of high dimensional genomic data whenever one wishes to perform statistical tests for each of many genes or genomic regions. Identifying differently expressed genes from microarray experiments is a typical example. I will first present a statistical framework for multiple hypothesis testing. From this perspective, I will review current multiple testing procedures, including choices of error rate, test statistics, null distribution and error control. Finally, I will present recent results regarding a general characterization of the null distribution for multiple testing that asymptotically controls type I error rates without conditions such as subset pivotality. This characterization is novel because it directly utilizes the distribution of the test statistics rather than a data null distribution. I will illustrate how one can use a simple bootstrap estimator of this test statistic null distribution in single-step and step-down methods to control type I error rates. These methods will be implemented in the next release of the Bioconductor multtest package.


Variability and Data Transformation for Gene Expression, Proteomics, and Metabolomics Data
David M. Rocke
Division of Biostatistics (School of Medicine), Department of Applied Science (College of Engineering), and Institute for Data Analysis and Visualization
University of California, Davis

Slides (PDF)
                                                                               
Biologists now have the capacity to measure thousands of compounds simultaneously from a single biological sample using gene expression arrays, mass spectrometry, NMR spectroscopy or other methods. These methods can be used to measure mRNA transcripts, proteins, short peptides, lipids, and other biologically active compounds. In this talk, I will describe an important statistical challenge in the use of such data. Using raw data, logarithms, or ratios, the variability of the measurements is strongly dependent on the level of expression, causing a failure of the assumptions of most standard methods of statistical analysis.  We present a solution to this problem via a specially tuned data transformation and show how it promotes the effectiveness of simple and sophisticated analyses of the data.


Logic Regression in SNP association studies
Ingo Ruczinski
Department of Biostatistics, Johns Hopkins University

Slides (PDF)

Logic Regression was recently introduced as a novel classification and regression method, particularly useful in SNP association studies. This adaptive methodology is based on new predictors being generated as Boolean combinations from binary covariates, and hence models with high order interactions can be explored.  We present the methodology, show some case studies, and discuss statistical issues such as model selection, missing data, variable importance, and study design.