Statistical Methods and Software for the
Analysis of Microarray Experiments
Nicholas P. Jewell and Sandrine Dudoit
Division of Biostatistics, UC Berkeley

Genomics, Proteomics, and Bioinformatics
Mathematical Biosciences Institute
Ohio State University, Columbus, OH
September 20--24, 2004



Acknowledgments
Abstract
Schedule: lecture notes and computer labs
Software
Links and References


Announcements

25/09/2004: Added references and links for Lecture and Lab 5.
25/09/2004: Added example of .tex file for Lab 4.


Acknowledgments

Many thanks to the following collaborators:
Katie Pollard, for making the development versions of hopach and multtest available for this course.
Mark van der Laan, for the work on cluster analysis and multiple hypothesis testing.
Bioconductor Core Developers, Ben Bolstad, Vince Carey, Robert Gentleman, Wolfgang Huber, Rafael Irizarry,Yee Hwa (Jean) Yang, for their contributions to the development of the course materials.


Abstract

DNA microarray and other high-throughput genomic experiments generate complex, high-dimensional datasets of multiple types. Extracting meaningful and reliable biological information from the analysis of these data presents new statistical and computational challenges. This tutorial will discuss statistical design and inference methods for microarray experiments. Topics to be covered include: pre-processing (image analysis and normalization); multiple testing procedures for the identification of differentially expressed genes; hierarchical and partitioning cluster analysis; prediction; and model selection. We will also consider the joint analysis of microarray data with biological metadata such as Gene Ontology (GO) annotation (www.geneontology.org). The statistical methods to be discussed apply to a broad range of problems beyond the analysis of microarray data, such as the genetic mapping of complex traits using single nucleotide polymorphisms (SNPs) and the identification of transcription factor binding sites in ChiP-Chip experiments. Computer lab sessions will allow participants to explore statistical software resources for the analysis of genomic data, with emphasis on R (www.r-project.org) packages developed as part of the Bioconductor Project (www.bioconductor.org).


Schedule: lecture notes and computer labs

Day 1
Monday, September 20
Lecture 1: (Sandrine Dudoit)
Basic genome biology [PDF]
Microarray technologies [PDF]
Lab 1:
Introduction to Bioconductor and R [PDF]
Day 2
Tuesday, September 21
Lecture 2: (Sandrine Dudoit)
Pre-processing: image analysis and normalization [PDF]
Lab 2:
Pre-processing
Day 3
Wednesday, September 22
Lecture 3: (Nick Jewell)
Cluster analysis  [PDF]
Alzheimer study [PDF slides]
Lab 3:
Cluster analysis
Day 4
Thursday, September 23
Lecture 4: (Nick Jewell)
Identification of differentially exressed genes (multiple testing) [PDF]
Lab 4:
Differential gene expression
Day 5
Friday, September 24
Lecture 5: (Nick Jewell)
Experimental design [KolleKolle PDF] [References]

Lab 5: (Sandrine Dudoit)
Odds and ends
End of short course.

Thank you!
Lectures
09:00--10:00
10:30--11:30
Computer labs
14:00--15:00

Software

R, Version 1.9.1
from CRAN
Linux/Unix: sources and packages for Debian, Mandrake, RedHat/Fedora, Suse, Vine.
Windows precompiled binaries.

Additional CRAN packages: ellipse.

Bioconductor Packages, Version 1.4
from Bioconductor Project website

Use
getBioC  install script:
At the R prompt
source("http://www.bioconductor.org/getBioC.R")
getBioC()

Additional Bioconductor packages:
Analysis (Release 1.4): hexbin.
Experimental data: ALL, golubEsets.
Annotation metadata: hgu95av2hu6800.

Development version of R package hopach -- Please DO NOT distribute
Windows
Linux/Unix

Development version of R package multtest -- Please DO NOT distribute
Windows
Linux/Unix

Please consult package vignettes and help files for documentation on methodology, software, and references.
Labs

Lab 2
Pre-processing two-color spotted microarray data:
[.pdf] [.Rnw] [.R]
Pre-processing Affymetrix oligonucleotide chip data:
[.pdf] [.Rnw] [.R]

Lab 3
Cluster analysis:
[.pdf] [.Rnw] [.R]

Development version of R package hopach -- Please DO NOT distribute
Windows
Linux/Unix

Lab 4
Differential gene expression:
[.pdf] [.Rnw] [.R] [.tex] [.bib]

Development version of R package multtest -- Please DO NOT distribute
Windows
Linux/Unix

Lab 5
Odds and ends:
Sweave:
[WWW Leisch] [PDF Lect 1]
Creating an R package:
[
PDF R Manual] [PDF Lect 1]
OOP:
[J. M. Chambers Tutorial] [PDF Lect 1]
Annotation:
[PDF Lect 1]
MSP normalization control:
[PDF Lect 2] [Article]


Links and References

Supplements for this Short Course

UC Berkeley Division of Biostatistics Working Papers Series
Bioconductor Project Working Papers Series
Statistical Applications in Genetics and Molecular Biology
Bioconductor Short Courses

Additional Lecture Notes and Labs

Kolle Kolle short course
EMBO short course

Pre-processing Two-Color Spotted Microarray Data
  • Y. H. Yang, M. J. Buckley, S. Dudoit, and T. P. Speed (2002). Comparison of methods for image analysis on cDNA microarray data. Journal of Computational and Graphical Statistics, Vol. 11, No. 1, p. 108-136. (Tech report #584).
  • S. Dudoit and Y. H. Yang (2003). Bioconductor R packages for exploratory analysis and normalization of cDNA microarray data. In G. Parmigiani, E. S. Garrett, R. A. Irizarry and S. L. Zeger, editors, The Analysis of Gene Expression Data: Methods and Software, Springer, New York, p. 73-101.
    (Table of contents) (Rnw file) (PDF file).
  • Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, Vol. 30, No. 4, e15. (Journal website).
  • Y. H. Yang, S. Dudoit, P. Luu, and T. P. Speed (2001). Normalization for cDNA microarray data. In M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R. Dougherty (eds), Microarrays: Optical Technologies and Informatics, Vol. 4266 of Proceedings of SPIE, p. 141-152. (Tech report #589).
  • Yee Hwa (Jean) Yang's webpage.
Pre-Processing Affymetrix Oligonucleotide Chip Data
Ben Bolstad's webpage: thesis, talks, papers, software, image gallery.
Speed Berkeley Research Group
Affymetrix website.
 
Experimental Design
Gary Churchill's Statistical Genetics Group
Speed Berkeley Research Group
Yee Hwa (Jean) Yang's webpage.
 
Multiple Testing
  • S. Dudoit, M. J. van der Laan, and K. S. Pollard (2004). Multiple testing. Part I. Single-step procedures for control of general Type I error rates. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 13. (Journal website) (Tech report #138).
  • M. J. van der Laan, S. Dudoit, and K. S. Pollard (2004). Multiple testing. Part II. Step-down procedures for control of the family-wise error rate. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 14. (Journal website)  (Tech report #139).
  • M. J. van der Laan, S. Dudoit, and K. S. Pollard (2004). Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives.  Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 15. (Journal website) (Tech report #141).
  • K. S. Pollard and M. J. van der Laan (2003). Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data. (Tech report #121).
Cluster Analysis

More articles available on Mark van der Laan's website.
  • M. J. van der Laan and K. S. Pollard (2003). Hybrid clustering of gene expression data with visualization and the bootstrap. Journal of Statistical Planning and Inference,  Vol. 117, p. 275--303.  (PDF).
  • K. S. Pollard and M. J. van der Laan (2002). A method to identify significant clusters in gene expression data. Proceedings of SCI 2002, Vol. II,  p. 318--325. (Tech report #107).
  • M. J.  van der Laan, K. S. Pollard, and J. Bryan (2003). A New Partitioning Around Medoids Algorithm. Journal of Statistical Computation and Simulation, Vol. 73, No. 8, p. 575--584. (Tech report #105).
  • M. J. van der Laan and J. Bryan (2001). Gene Expression Analysis with the Parametric Bootstrap. Biostatistics, Vol 2, No. 4, p. 445-461 (PDF preprint).
  •  L. Kaufman and P. J. Rousseeuw (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience. [Details on PAM, AGNES, CLARA, silhouette widths.]
  • J. H. Friedman and L. L. Meulman (2004). Clustering Objects on Subsets of Attributes (COSA).
Statistical Computing
R Project
Comprehensive R Archive Network
R News
Building R for Windows
Omega Project for Statistical Computing
Bioconductor Project
Distributed Statistical Computing 2003
Phil Spector's Homepage

Miscellaneous Software
CVS: Concurrent Versions Systems.
Cygwin: a Linux-like environment for Windows.
Emacs: A text editor and more.
ESS: Emacs Speaks Statistics.
MikTeX: TeX implementation for Windows.
Other software links from CRAN.

Visualization
GGobi
William S. Cleveland
Ross Ihaka
Paul Murrell
The Work of Edward Tufte and Graphics Press
Introductory Biology
Access Excellence
Human Genome Project Information
Mendel Web
About NCBI: follow links to A Science Primer, Outreach and Education.
Robert J. Huskey, Biology, U. of Virginia

Microarray Experiments
Bioconductor Short Courses
DNA microarray methodology animation
Microarray Gene Expression Data (MGED) Society
Rockefeller University Microarray Data Analysis Bibliography
Stanford Genomic Resources
Whitehead Institute, Cancer Genomics
A guidebook for DNA Microarray Data Analysis, Finnish IT Center for Science.
Genetics and Genomics Timeline, Genome News Network.