Announcements 25/09/2004: Added references
and links for Lecture and Lab 5. 25/09/2004: Added example of
.tex file for Lab 4.
Acknowledgments
Many thanks to
the following collaborators:
Katie Pollard, for making
the development versions of hopach
and multtest
available for this course.
Mark van der Laan, for the
work on cluster analysis and multiple hypothesis testing.
Bioconductor Core Developers, Ben
Bolstad, Vince Carey, Robert Gentleman, Wolfgang Huber, Rafael
Irizarry,Yee Hwa (Jean) Yang, for their contributions to the
development of the course materials.
Abstract
DNA
microarray and other high-throughput genomic experiments generate
complex, high-dimensional datasets of multiple types. Extracting
meaningful and reliable biological information from the analysis of
these data presents new statistical and computational challenges. This
tutorial will discuss statistical design and inference methods for
microarray experiments. Topics to be covered include: pre-processing
(image analysis and normalization); multiple testing procedures for the
identification of differentially expressed genes; hierarchical and
partitioning cluster analysis; prediction; and model selection. We will
also consider the joint analysis of microarray data with biological
metadata such as Gene Ontology (GO) annotation (www.geneontology.org). The
statistical methods to be discussed apply to a broad range of problems
beyond the analysis of microarray data, such as the genetic mapping of
complex traits using single nucleotide polymorphisms (SNPs) and the
identification of transcription factor binding sites in ChiP-Chip
experiments. Computer lab sessions will allow participants to explore
statistical software resources for the analysis of genomic data, with
emphasis on R (www.r-project.org)
packages developed as part of the Bioconductor Project (www.bioconductor.org).
Y. H. Yang, M. J. Buckley, S.
Dudoit, and T. P. Speed
(2002). Comparison
of methods for image analysis on cDNA microarray data. Journal of
Computational
and Graphical Statistics, Vol. 11, No. 1, p. 108-136. (Tech
report #584).
S. Dudoit and Y. H. Yang
(2003). Bioconductor R
packages for exploratory
analysis and normalization of cDNA microarray data. In G. Parmigiani,
E.
S. Garrett, R. A. Irizarry and S. L. Zeger, editors, The Analysis
of
Gene Expression Data: Methods and Software, Springer, New York,
p. 73-101.
(Table of contents)
(Rnw
file) (PDF
file).
Y. H. Yang, S. Dudoit, P.
Luu, D. M. Lin, V. Peng, J.
Ngai, and
T. P. Speed (2002). Normalization for cDNA microarray data: a robust
composite
method addressing single and multiple slide systematic variation. Nucleic
Acids Research, Vol. 30, No. 4, e15. (Journal
website).
Y. H. Yang, S. Dudoit, P.
Luu, and T. P. Speed (2001).
Normalization
for cDNA microarray data. In M. L. Bittner, Y. Chen, A. N. Dorsel, and
E. R. Dougherty (eds), Microarrays: Optical Technologies and
Informatics,
Vol. 4266 of Proceedings of SPIE, p. 141-152. (Tech
report #589).
S. Dudoit, M. J. van der
Laan, and K. S. Pollard
(2004). Multiple testing. Part I. Single-step procedures for control of
general Type I error rates. Statistical
Applications in Genetics and Molecular Biology, Vol. 3, No. 1,
Article 13. (Journal
website) (Tech
report #138).
M. J. van der Laan, S. Dudoit,
and K. S. Pollard
(2004). Multiple testing. Part II. Step-down procedures for control of
the family-wise error rate. Statistical
Applications in Genetics and Molecular Biology, Vol. 3, No. 1,
Article 14. (Journal
website) (Tech
report #139).
M. J. van der Laan, S. Dudoit,
and K. S. Pollard
(2004). Augmentation procedures for control of the generalized
family-wise error rate and tail probabilities for the proportion of
false positives. Statistical
Applications in Genetics and Molecular Biology, Vol. 3, No. 1,
Article 15. (Journal
website) (Tech
report #141).
K. S. Pollard and M. J. van der Laan (2003). Resampling-based
Multiple Testing: Asymptotic Control of Type I Error and Applications
to Gene Expression Data. (Tech
report #121).
M. J. van der Laan and K. S. Pollard (2003).
Hybrid clustering of gene expression data with visualization and the
bootstrap. Journal of Statistical
Planning and Inference, Vol. 117, p. 275--303.
(PDF).
K. S. Pollard and M. J. van der Laan (2002). A
method to identify significant clusters in gene expression data. Proceedings of SCI 2002, Vol.
II, p. 318--325. (Tech
report #107).
M. J. van der Laan, K. S. Pollard, and J.
Bryan (2003). A New Partitioning Around Medoids Algorithm. Journal of Statistical Computation and
Simulation, Vol. 73, No. 8, p. 575--584. (Tech
report #105).
M. J.
van der
Laan and J. Bryan (2001). Gene Expression Analysis with the Parametric
Bootstrap. Biostatistics, Vol 2, No. 4, p. 445-461 (PDF
preprint).
L. Kaufman and P. J.
Rousseeuw (1990). Finding Groups in
Data: An Introduction to
Cluster Analysis. Wiley-Interscience. [Details on PAM, AGNES,
CLARA, silhouette widths.]
J. H. Friedman and L. L. Meulman (2004). Clustering Objects on
Subsets of Attributes (COSA).
Cygwin:
a Linux-like environment for Windows. Emacs:
A text editor and more. ESS: Emacs Speaks
Statistics. MikTeX: TeX implementation
for Windows. Other
software links from CRAN.