Test statistics null distributions in multiple testing: Simulation studies and applications to genomics
K. S. Pollard, M. D. Birkner, M. J. van der Laan, and S. Dudoit (2005).
Num
éro double spécial Statistique et Biopuces Journal de la Société Française de Statistique, Vol. 146, No. 1-2, p. 77-115.

Web companion

Abstract [English]
Résumé [Français]
Full article [PDF]

Tables and figures for miRNA data analysis
Bioconductor R package multtest
Created by Sandrine Dudoit

Abstract.

Multiple hypothesis testing problems arise frequently in biomedical and genomic research, for instance, when identifying differentially expressed and co-expressed genes in microarray experiments. We have developed generally applicable resamplingbased single-step and stepwise multiple testing procedures (MTP) for controlling a broad class of Type I error rates, defined as tail probabilities and expected values for arbitrary functions of the numbers of false positives and rejected null hypotheses. A key feature of the methodology is the general characterization and explicit construction of a test statistics null distribution (rather than data generating null distribution), which provides Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses defined in terms of submodels, and test statistics.

This article presents simulation studies comparing test statistics null distributions in two testing scenarios of great relevance to biomedical and genomic data analysis: tests for regression coefficients in linear models where covariates and error terms are allowed to be dependent and tests for correlation coefficients. The simulation studiesdemonstratethatthechoiceofnulldistributioncanhaveasubstantialimpact on the Type I error properties of a given multiple testing procedure. Procedures based on our proposed non-parametric bootstrap test statistics null distribution typically control the Type I error rate "on target" at the nominal level, while comparable procedures, based on parameter-specific bootstrap data generating null distributions, can be severely anti-conservative or conservative. The analysis of microRNA expression data from cancerous and non-cancerous tissues (Lu et al., 2005), using tests for logistic regression coefficients and correlation coefficients, illustrates the flexibility and power of our proposed methodology.


Résumé.

Les tests d'hypothèses multiples sont fréquemment utilisés dans le domaine de la recherche biomédicale et génomique, en l'occurrence, pour l'identification de gènes différentiellement exprimés et co-exprimés à partir des données issues de puces à ADN. Nous avons développé des procédures de tests multiples avec ré-échantillonnage, à pas simple mais aussi pas à pas, pour contrôler une vaste classe de taux d'erreurs de première espèce, définis par des probabilités de queues de distributions et des espérances de fonctions arbitraires du nombre de faux positifs et du nombre total d'hypothèses nulles rejetées. Parmi les contributions fondamentales de notre méthodologie, notons la caractérisation générale et la construction explicite d'une distribution nulle pour les statistiques de test (plutôt qu'une distribution génératrice de données nulle). Cette distribution garantit le contrôle du taux d'erreurs de première espèce pour des problèmes de tests multiples pour des lois génératrices de données présentant une structure de dépendance quelconque, des hypothèses nulles définies de manière générale en terme de sous-modèles, et des statistiques de test arbitraires.

Cet article présente des études par simulation pour la comparaison de distributions nulles des statistiques de test, sous deux scénarios particulièrement pertinents à l'analyse de données biomédicales et génomiques: les tests sur les coefficients de régression pour des modèles linéaires dans le cas où les covariables et les erreurs peuvent être dépendantes et les tests sur les coefficients de corrélation. Les études par simulation démontrent que le choix d'une distribution nulle peut considérablement influer les taux d'erreurs de première espèce d'une procédure donnée de tests multiples. Les procédures fondées sur notre distribution nulle bootstrap non-paramétrique pour les statistiques de test contrôlent le taux d'erreurs de première espèce au niveau nominal, alors que des procédures comparables, fondées sur des distributions bootstrap paramétriques nulles pour les données, peuvent être très anti-conservatrices ou conservatrices. L'analyse de données sur l'expression de microARN dans des tissus cancéreux et non-cancéreux (Lu et al., 2005), par tests pour coefficients de régression logistique et coefficients de corrélation, illustre la flexibilité et la puissance de notre méthodologie.


miRNA data analysis.

Lu et al. (2005): Web companion.

More information on the identified miRNAs may be obtained from miRBase.


Supplementary Table 2.
  miRNA data analysis: Tests for logistic regression coefficients.
The table reports the names, target sequences, adjusted p-values, and test statistics, for the 53 miRNAs most significantly differentially expressed between cancerous and non-cancerous tissues, according to bootstrap-based single-step maxT Procedure 3. miRNAs are sorted in decreasing order of their absolute test statistics T_n(j). All 53 miRNAs have adjusted p-values less than 0.01 and negative test statistics, indicating under-expression in cancerous compared to non-cancerous tissues. The target sequence is the reverse complement of the miRNA sequence and identifies potential binding sites for the miRNA.
Located in minimal deleted regions, minimal amplified regions, and breakpoint regions involved in human cancers (Calin et al., 2004).


Name miRNA target sequence Adjusted p-value Test statistic
hsa-miR-98 UGAGGUAGUAAGUUGUAUUGUU 0.0038 -4.88
hsa-miR-28 AAGGAGCUCACAGUCUAUUGAG 0.0038 -4.79
hsa-miR-196 UAGGUAGUUUCAUGUUGUUGG 0.0038 -4.79
hsa-miR-30a CUUUCAGUCGGAUGUUUGCAGC 0.0038 -4.78
hsa-miR-30e UGUAAACAUCCUUGACUGGA 0.0038 -4.78
hsa-miR-99a AACCCGUAGAUCCGAUCUUGUG 0.0038 -4.77
hsa-miR-335 UCAAGAGCAAUAACGAAAAAUGU 0.0038 -4.72
hsa-let-7e UGAGGUAGGAGGUUGUAUAGU 0.0038 -4.69
hsa-miR-23b AUCACAUUGCCAGGGAUUACCAC 0.0038 -4.67
hsa-miR-214 ACAGCAGGCACAGACAGGCAG 0.0038 -4.67
hsa-miR-99b CACCCGUAGAACCGACCUUGCG 0.0038 -4.67
hsa-miR-30c UGUAAACAUCCUACACUCUCAGC 0.0038 -4.66
hsa-miR-30b UGUAAACAUCCUACACUCAGC 0.0038 -4.66
hsa-miR-338 UCCAGCAUCAGUGAUUUUGUUGA 0.0038 -4.65
hsa-miR-103 AGCAGCAUUGUACAGGGCUAUGA 0.0038 -4.64
hsa-miR-185 UGGAGAGAAAGGCAGUUC 0.0038 -4.63
hsa-miR-151* UCGAGGAGCUCACAGUCUAGUA 0.0038 -4.62
hsa-miR-100 AACCCGUAGAUCCGAACUUGUG 0.0038 -4.61
hsa-miR-20_(sub_1) UAAAGUGCUUAUAGUGCAGGUAG 0.0038 -4.61
hsa-miR-129* AAGCCCUUACCCCAAAAAGCAU 0.0038 -4.60
hsa-miR-22 AAGCUGCCAGUUGAAGAACUGU 0.0038 -4.60
hsa-let-7d AGAGGUAGUAGGUUGCAUAGU 0.0038 -4.58
hsa-miR-107 AGCAGCAUUGUACAGGGCUAUCA 0.0038 -4.58
rno-miR-352 AGAGUAGUAGGUUGCAUAGUA 0.0038 -4.58
hsa-miR-197 UUCACCACCUUCUCCACCCAGC 0.0038 -4.57
hsa-miR-32 UAUUGCACAUUACUAAGUUGC 0.0038 -4.57
hsa-miR-342 UCUCACACAGAAAUCGCACCCGUC 0.0038 -4.56
hsa-miR-324-5p CGCAUCCCCUAGGGCAUUGGUGU 0.0038 -4.51
hsa-miR-128b UCACAGUGAACCGGUCUCUUUC 0.0038 -4.51
hsa-miR-126* CAUUAUUACUUUUGGUACGCG 0.0038 -4.50
hsa-miR-19b UGUGCAAAUCCAUGCAAAACUGA 0.0038 -4.49
hsa-miR-151_(sub_1) ACUAGACUGAGGCUCCUUGAGG 0.0038 -4.49
hsa-miR-199a* UACAGUAGUCUGCACAUUGGUU 0.0038 -4.48
hsa-let-7i UGAGGUAGUAGUUUGUGCU 0.0038 -4.48
hsa-miR-10b UACCCUGUAGAACCGAAUUUGU 0.0038 -4.47
miR-292-3p AAGUGCCGCCAGGUUUUGAGUGU 0.0040 -4.46
hsa-miR-136 ACUCCAUUUGUUUUGAUGAUGGA 0.0042 -4.45
mmu-miR-10b CCCUGUAGAACCGAAUUUGUGU 0.0042 -4.45
hsa-let-7f UGAGGUAGUAGAUUGUAUAGUU 0.0042 -4.44
hsa-miR-302 UAAGUGCUUCCAUGUUUUGGUGA 0.0042 -4.43
mmu-let-7g UGAGGUAGUAGUUUGUACAGU 0.0042 -4.43
hsa-miR-10a UACCCUGUAGAUCCGAAUUUGUG 0.0042 -4.42
hsa-miR-34b AGGCAGUGUCAUUAGCUGAUUG 0.0042 -4.42
hsa-miR-92 UAUUGCACUUGUCCCGGCCUGU 0.0042 -4.42
hsa-miR-101 UACAGUACUGUGAUAACUGAAG 0.0044 -4.38
hsa-miR-16 UAGCAGCACGUAAAUAUUGGCG 0.0046 -4.37
mmu-miR-339 UCCCUGUCCUCCAGGAGCUCA 0.0046 -4.37
hsa-miR-19a UGUGCAAAUCUAUGCAAAACUGA 0.0046 -4.37
hsa-miR-152 UCAGUGCAUGACAGAACUUGG 0.0052 -4.35
hsa-miR-23a AUCACAUUGCCAGGGAUUUCC 0.0052 -4.34
hsa-miR-186 CAAAGAAUUCUCCUUUUGGGCUU 0.0072 -4.30
rno-miR-343 UCUCCCUCCGUGUGCCCAGU 0.0096 -4.29
hsa-miR-140 AGUGGUUUUACCCUAUGGUAG 0.0096 -4.28

Supplementary Table 3. miRNA data analysis: Tests for correlation coefficients.
The table reports the names and correlation coefficients for the twenty most significantly co-expressed pairs of miRNAs, according to bootstrap-based single-step maxT Procedure 3. miRNA pairs are sorted in decreasing order of their absolute correlation coefficients rho_n(j,j').

Up-regulated by the proto-oncogene c-MYC (O'Donnell et al., 2005).
Increases cell growth in lung carcinomas (Cheng et al., 2005).
Expressed at lower levels in cancerous and pre-cancerous tissue compared to normal colon tissue (Michael et al., 2003).



Names Correlation coefficient
hsa-miR-106a
hsa-miR-17-5p 0.99
mmu-miR-200b hsa-miR-200b 0.99
mmu-miR-200b hsa-miR-200c 0.99
hsa-miR-107 hsa-miR-103 0.99
hsa-miR-200b hsa-miR-200c 0.99
hsa-miR-145 hsa-miR-143 0.98
hsa-miR-199a_(sub_1) mmu-miR-199b 0.98
hsa-miR-17-5p hsa-miR-20_(sub_1) 0.97
hsa-miR-19a hsa-miR-19b 0.97
hsa-miR-29a hsa-miR-30a* 0.97
hsa-miR-181a hsa-miR-181c 0.97
hsa-miR-199a_(sub_1) hsa-miR-199a* 0.97
hsa-miR-29b_(sub_2) hsa-miR-29c 0.97
hsa-miR-199a* mmu-miR-199b 0.96
hsa-miR-200a hsa-miR-141 0.96
hsa-miR-20_(sub_1) mmu-miR-106a 0.96
hsa-miR-106a hsa-miR-20_(sub_1) 0.96
hsa-miR-200a hsa-miR-200a 0.96
hsa-miR-23b hsa-miR-23a 0.96
hsa-miR-10a hsa-miR-10b 0.96


Supplementary Figure 6.
miRNA data analysis: HOPACH clustering of miRNA expression profiles.
Click to enlarge
The figure provides a pseudo-color image of the 155 x 155 correlation matrix for the expression profiles of the J=155 miRNAs. Rows and columns are ordered according to the final level of the hierarchical tree of miRNA clusters produced by the HOPACH algorithm with Pearson correlation distance. Pairwise correlation coefficients not significantly different from zero are displayed in black. The remaining correlation coefficients are represented using a white (anti-correlated) to red (positively-correlated) color palette. Groups of co-expressed miRNAs appear as red blocks along the diagonal of the correlation matrix. The twenty most significantly correlated pairs of miRNAs from Table 3 are marked in blue.
[Click on image to enlarge]