The Cult of Statistical Significance

This copies the title of an argumentative 2008 book by Stephen Ziliak and Deirdre McCloskey. I myself say that the misuse of statistical significance was one of the greatest scientific disasters of the 20th century, unfortunately continuing into the 21st. I agree with the The American Statistical Association's Statement on p-Values which gives a consensus view of the substantive issues, so the reader unfamiliar with the issues would benefit from reading that first. Here I just give some commentary.

If all you have is a hammer ........

The oft-quoted saying if all you have is a hammer, then everything looks like a nail is apposite. Tests of significance do one thing really well: they stop you jumping to conclusions based on too little data. Because this topic lends itself to definite rules which can be mechanically implemented, it has been prominently featured in introductory statistics courses and textbooks for 80 years. But any claim of some positive conclusion should be approached with skepticism.

Misleading presentation of results of statistical analysis

Informally, a test of significance seeks to determine whether some observed effect or difference is, beyond reasonable doubt, real rather than being plausibly attributable to chance. But in this context the words real and chance and significance are all at last somewhat misleading. One issue (see below) is making sense of the "due to chance" part. Even when one can do so, in textbook settings like random sampling or randomized controlled experiments, the notion of asking whether the size of an effect is zero or not zero seems perverse. The reformulation as (frequentist) confidence intervals for the size of effect seems much more informative about the data. The corresponding Bayesian analysis is better in the "presentation" sense, because stating a posterior distribution for "size of effect" reminds one this is a statement about a probability model, not just about the data.

One illustration mentioned here is the debate over whether the data on whether "hot hands" are real, that is whether observed sports streaks are more extreme than expected "by chance". The very fact that a debate still exists, after much study, implies the effect could only be very small -- so it hardly matters whether it's real or not.

The null hypothesis is often a straw man

A null hypothesis starts as the possibility that an observed difference could have occurred "just by chance", but any analysis requires that first this is formulated as a math model. In the simplest settings of freshman textbooks, the model is that the data (as differences from an expected value) is IID (independent and identically distributed, visualized as mathematically similar to random draws from a box of numbered tickets) and the null hypothesis is that the distribution being sampled from has mean zero. Then saying that the observed difference is statistically significant is saying that (beyond reasonable doubt) the mean of the distribution is not zero.

But what is the real-world meaning of this conclusion? In general it is only worth making the effort to knock down a hypothesis if a reasonable person would believe that hypothesis might possibly be correct -- otherwise we are demolishing a straw man. So to interpret any "statistically significant" conclusion as informative, we need to ponder whether the specific model of "pure chance" might possibly seem true to someone who has not seen our data. By analogy, no-one would argue as follows

I'm confident that variables \(x\) and \(y\) are related in some deterministic way. I don't know how they are related, but let me assume (because I can solve it) the quadratic relation \(y = ax^2+bx+c \). I can now conclude \( x=\frac{-b\pm\sqrt{b^2-4a(c-y)\ }}{2a} \) .
But the freshman statistics argument underlying a simple test of significance
I have observations \(x_1, x_2, \ldots, x_n\) with mean \(\bar{x}\) and variance \(s^2\). I don't know much about how these particular values arose, but anyway if I observe that \(\bar{x}\) is greater than \(2 s/\sqrt{n} \) then I can confidently conclude that observations generated in this way will not have long-run average equal to zero
is equally suspect. It is not valid if the observations are "random" in only the informal sense of haphazard or unpredictable; it requires us to believe the very specific IID assumption, exactly as the former argument requires us to believe the very specific quadratic assumption. Good freshman textbooks (like Berkeley's Freedman-Pisani-Purves) emphasize this point but many scientists never got the message. And as mentioned elsewhere, statisticians themselves are reluctant to ask the basic question "when is it reasonable to regard data as mathematically equivalent to an IID sample from some unknown distribution?"

Why does the misuse continue?

Statisticians blame science journals for continued insistence on reporting \(p\)-values. They should instead follow the advice in the The American Statistical Association's Statement on p-Values mentioned above. This is an instance of the cynical view
The only part of academic life not mired in tradition is the cost of tuition. (xxx lost author citation).
For a (fortunately less widespread) Bayesian counterpoint see The cult of informationless priors.