Two Faces

The actual two faces of Statistics: to explain or to predict?

Popular books often over-sell the acceptance of Bayesian methods as a revolution in Statistics, a particularly egregious example being provided by the subtitle How Bayes' rule cracked the Enigma Code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy. Of course the advent of desktop computing and then of "Big Data" has changed the way everyone does Statistics. The idea that there exist alternate methodologies labeled "frequentist" and "Bayesian" has an element of truth, but the idea these are competing methodologies is misleading. I have asked working statisticians

do you know any example of data for which there are reasonable frequentist and Bayesian analyses which give substantially different conclusions?

because this would make an interesting discussion topic in my undergraduate course. But I have never found such an example. The lesson is that explicitly-Bayesian methods are useful is some contexts, and other methods in different contexts.

One view of a more substantial dichotomy was given by Leo Breiman in a paper Statistical modeling: The two cultures:

One [culture] assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

A helpful account of the latter culture is given by David Donoho in a paper 50 years of Data Science in a section titled The Predictive Culture’s Secret Sauce. He emphasized that the dramatic success of Machine Learning has been facilitated by structured competitions such as the Netflix Challenge, where one can judge which prediction methods work best on real data. My own take on this dichotomy comes from a title of a Galit Shmueli paper: To explain or to predict? The early- and mid-20th-century development of mathematical statistics focused on data from science experiments -- seeking to explain whether data was consistent with a model -- as can be seen from the data sources for Fisher's 1923 Statistical Methods for Research Workers. In the later-20th-century, statistical data about the human social and economic world became more prominent, but here simple explanatory models reflecting "reality" rather than convenience are much less plausible. What we now see every day, in Google's search results and Amazon's suggested purchases, are instead just the output of algorithms predicting what we might like based on past data from similar customers.

(None of the 3 papers above is easy reading, but I encourage undergraduates interested in the conceptual side of Statistics to look at them.)