On prediction tournaments

Let me present this as a puzzle. Is it possible to devise a quiz contest (on any topic) with the following properties? Surely this looks impossible at first sight -- how can one grade objectively without knowing the answers? Yet I can describe a setting where it is possible. Puzzles like this inevitably involve some kind of trick. But my trick is rather mild -- an everyday quiz can be graded quickly, but for my quiz you have to wait a while to find your scores.

A prediction tournament

Here are 4 questions that people with an interest in world affairs might be pondering as I write (June 2018). In the current Good Judgment Open Classic Geopolitical Challenge contestants are asked to assess the current probabilities of such future events. To reiterate, they are not asked to give a Yes/No prediction, but instead are asked to give a numerical probability, and to update as time passes and relevant news/analysis appears. Unlike school quizzes, you are free to use any sources you can -- if you happen to be a personal friend of someone with inside information then you could ask them for a hint.

The point is that no-one will ever know the correct probabilities on a given day. Nevertheless, as outlined below it is possible to objectively measure participants' relative ability to assess such probabilities, after the outcomes are known. So this fits the rules for my puzzle.

Scoring a prediction tournament

Represent an event as a random variable, taking value \(1\) if the event happens and value \(0\) if not. This allows us to use "squared error" to score our predictions. If we predict 70% probability for an event, then our "squared error" is \begin{eqnarray*} \mbox{(if event happens)} \quad (1.0 - 0.70)^2 = 0.09 \\ \mbox{(if event doesn't happen)} \quad (0.7 - 0)^2 = 0.49. \end{eqnarray*} Suppose you participate in a prediction tournament, and for simplicity let's suppose that participants just make a one-time forecast, a probability prediction, for each event. After the outcomes of all the events are known, your final score will be the average of these squared errors. As in golf, you are trying to get a low score.

In a prediction tournament there will be a large number \(n\) of events, with unknown probabilities \((q_i, 1 \le i \le n) \) and with forecasts \((p_{A,i}, p_{B,i}, \ldots, 1 \le i \le n) \) chosen by participants \(A, B, \ldots\). We would like to measure how good a participant is by the average squared-error of their forecast probabilities \begin{equation} \mathrm{MSE}(A) = \frac{1}{n} \sum_i (p_{A,i} - q_i)^2 . \label{def-MSE} \end{equation} But this is impossible to know, because we don't know the \(q\)'s, the true probabilities. However, a little algebra show that for the final scores (the average of the scores on each event) \begin{eqnarray} E[\mbox{final score (A)}] - E[\mbox{final score (B)}] = \mathrm{MSE}(A) - \mathrm{MSE}(B) . \label{average-3} \end{eqnarray} Now your actual final score is random, but by a "law of large numbers" argument, for a large number of events it will be close to its mean. Informally, \begin{eqnarray*} \mbox{final score (A)}] &=& E[\mbox{final score (A)}] \pm \mbox{ small random effect} . \end{eqnarray*} Putting all this together, \begin{eqnarray*} \mathrm{MSE}(A) - \mathrm{MSE}(B) = \mbox{final score (A)} - \mbox{final score (B)} \pm \mbox{ small random effect} . \end{eqnarray*} Now we are done: the MSEs are our desired measure of skill, and from the observed final scores we can tell the relative skills of the different participants, up to a small amount of luck.

The mathematical bottom line

Rephrasing the argument above and adding the little algebra, an individual's score is conceptually the sum of three terms. Recall \(q_i\) is the (unknown) true probability that the \(i\)'th event happens. The analogy with golf continues to be helpful. A golf course has a "par", the score that an expert should attain. Your score on a round of golf can also be regarded as the sum of three terms. So a prediction tournament is like a golf tournament where no-one knows "par". That is, you can assess people's relative abilities, but we do not have any external standard to assess absolute abilities.

Conceptual implications

I use this topic as a running theme which appears often in these pages. As discussed here these prediction market findings provides a touchstone for rejecting extreme philosophies of Probability. And there is an intriguing paradox that, under somewhat plausible assumptions, the winner of a prediction tournament is most likely to be one of the good-but-not-best cohort . Note also that the first way you might consider for scoring a prediction tournament -- whether
of the events estimated as having 60-70% chance, about 60-70% should actually occur
(and similarly for other ranges) -- is bad because you can game the system via dishonest announcements.


What's the bigger picture here? After all, one could just say it's obvious that some people will be better than others at geopolitical forecasts, just as some people are better than others at golf.

To me it is self-evident that one should make predictions about uncertain future events in terms of probabilities rather than Yes/No predictions. So it is curious that, outside of gambling-like contexts, this is rarely done. Indeed almost the only everyday context where one sees numerical probabilities expressed is the chance of rain tomorrow. A major inspiration for current interest in this topic has been the work of Philip Tetlock. His 2006 book Expert Political Judgment: How Good Is It? How Can We Know? looks at extensive data on how good geopolitical forecasts from political experts have been in the past (short answer: not very good). That book contains more mathematics along the "how to assess prediction skill" theme of this article.

What makes some people are better than others at forecasting, and can we learn from them? That is the topic of Tetlock's bestselling 2015 book Superforecasting: The Art and Science of Prediction, which reports in particular on an IARPA sponsored study of an earlier prediction tournament, though where participants were assigned to teams and encouraged to discuss with teammates. Their conclusions relate success to both cognitive style of individuals and to team dynamics.

My own extended account of this topic with a little math is a few sections in my ongoing lecture notes.