Does God play dice with the Earth?
(And if so, are they loaded?)

Philip B. Stark
Dept. of Statistics
University of California
Berkeley, CA 94720-3860
www.stat.berkeley.edu/~stark

Fourth SIAM Conference
on
Mathematical and Computational Methods in the Geosciences
Albuquerque, NM
16 June 1997

Anything you can do with Bayesian estimators that you can't do with frequentist estimators, you shouldn't.

Apologies to Einstein
What's a Bayesian?
Pragmatic versus epistemic inversion
Should scientific computations be personal?
Are parameters in your ontology?
Are "uninformative" priors uninformative? You don't always want what you get (apologies to the Rolling Stones).
What's the "worst" prior? (Uninformative vs. least favorable priors)
Minimax estimators are Bayes for the least favorable prior
Discretization and its discontents (apologies to Freud).

Definition of a Bayesian

According to I.J. Good (1965),

"...the essential defining property of a Bayesian is that he regards it as meaningful to talk about the probability P(H|E) of a hypothesis H, given evidence E."

Personally, I don't think such statements make sense: H is either true or false; there's nothing in between, and there's no probability involved. Not knowing whether a thing is true does not make its truth random.

Note the contrast between this situation and one where statements are produced "at random." It does make sense to me to talk about the probability that a random statement is true (before the particular statement is produced). This is the basis of frequentist confidence intervals.

The question of whether I should act as if H be true given evidence E is also different: that is the subject of statistical decision theory, including much of the frequentist literature on hypothesis testing.

Pragmatism versus Philosophy, and EDA versus Inference

In my mind, there's a big distinction between, for example, exploration seismology, where the goal is to find oil, and whole-Earth seismology, where the goal is "philosophical," to learn about the structure and functioning of Earth.

In exploration seismology, it doesn't matter whether the methods one uses have a sound philosophical or mathematical foundation---what matters is whether or not they help one find oil. If you drill a well, you find out whether or not your inversion scheme "worked." If a method helps you find oil, that's enough justification for using it.

In whole-Earth seismology, you cannot drill to determine whether your image of CMB topography is accurate. The only justification for studying the deep interior is "philosophical:" wanting to know "the truth" about the interior of Earth. In such a situation, I think that methods of inference need to meet a higher standard of rigor.

Similarly, when one is "exploring" a data set, any visualization method that helps one formulate hypotheses or suggests new experiments that can help one gain insight has value. This branch of Statistics is called "exploratory data analysis" (EDA).

In contrast, if one wants to draw a statistical inference, for example, to test an hypothesis or to make a confidence interval, I think the methodology needs to meet higher standards of rigor---one should be able to demonstrate that over a range of realistic sets of assumptions, the methods will give reliable answers.

In my opinion, Bayesian methods are appropriate for EDA, and are appropriate in exploration seismology to the extent that one can show that they help find oil.

I think Bayesian methods are categorically inappropriate for inference in physical sciences. I'll try to show you why in this talk.

Should Scientific Computations be Personal?

To tie mathematical probability calculations to the "real world" requires a "theory" (definition) of probability---a way to assign meaning to statements like "the probability of A is 1/3."

Bayesian theory is predicated on the Subjective Theory of Probability.

According to the Subjective Theory, probability is a statement about strength of belief--probability is defined in terms of states of mind.

In contrast, in the Frequency Theory, probability is defined in terms of long-term limiting relative frequencies of outcomes of experiments that are, in principle, repeatable by others.

In the subjective theory, "the probability of A is 1/3" is not a statement about A, it's a statement about the state of mind of the person making the assertion. (Of course, our state of mind can be influenced by beliefs about physics, etc., but it need not be.) Furthermore, there is no reason (and no requirement under the subjective theory) that different individuals should assign the same probability to the same event. There is a large literature on "thought experiments" by which one can determine the subjective probability of various events.

We usually claim in Science (even in California) to be learning about "objective reality," using experiments and procedures that can be replicated by others.

If the "thought experiments" to elicit subjective probabilities are repeated by others, the result is meaningless: the probabilities they assign need have no correspondence with ours---their assignments are supposed to reflect their states of mind, not ours. All that Bayesian theory requires is that an individual's subjective probabilities be internally consistent.

Bayesian data analysis is not "reproducible" by others, because others don't necessarily share the same prior.

If one uses Bayesian methods to draw conclusions from a given set of data, his results need not be meaningful to other investigators. They are not obliged to accept his conclusions unless they share his prior probability distribution (and trust that the computations were performed correctly). This seems to obviate the need for scientific journals for Bayesians---they are really speaking to an audience of one.

Either Bayesians need to use a variety of priors so that they have a hope of convincing not only themselves, but others too, or they need to provide a way for others to "plug in" their own priors and draw their own conclusions.

The only way out of this problem seems to be if it happens that the choice of prior doesn't matter.

The choice of prior always matters, but there is a way to measure how much it matters (more later).

Do parameters exist?

Do you have a weight?

If you weigh yourself on a scale, are you trying to refine your posterior (posterior distribution, as opposed to the distribution of your posterior!), or are you trying to figure out about how much you weigh?

Is "the true mass of the Earth at 0:00 GMT on 16 June 1997" a meaningful expression? Or is it only meaningful to talk about probability distributions for the mass of Earth?

In Bayesian data analysis, the primary quantities are probability distributions.

In Frequentist data analysis, the primary quantities are parameters. Probability enters into the design of experiments and the characterization of errors of measurement, but not the thing being estimated.

I believe that physical parameters can be meaningful, and that it makes sense to try to estimate them. I believe that at a particular time, I have a true weight, which a scale measures, but with error.

This attitude does lead to some circumlocutions, however. For example, it makes no sense to me to say "there is a 99% chance that the average radius of Earth is between 6370km and 6375km." At any given time, the average radius is some fixed distance, and the interval [6370km, 6375km] is a fixed range. Either the average radius is between 6370 and 6375km, or it isn't. There's no probability present.

In contrast, a Bayesian would have no trouble making a statement of that kind (see the statement by I.J. Good, above).

Are uninformative priors uninformative?

You don't always want what you get.

"Flat" priors are informative.
Backus (1987, 1988) shows that if you try to "soften" a quadratic bound to a probability distribution, you invariably get increasingly precise information about the exact value of the functional as the dimension of the model space grows.
There is no way to capture a strict bound (such as a bound on the energy stored in Earth's magnetic field) using a prior probability distribution without injecting new information not present in the bound. I'll give a one-dimensional example below to show that uniform priors are informative.
Can you always express ignorance as a probability distribution, even an improper one?
There's a big difference between no knowledge, and a distribution that is uniform in some parametrization.
David Freedman will talk about this more.
Consider the problem of estimating a probability p from the number of successes N in n independent trials with the same probability p of success. The number of successes N has a Binomial distribution with parameters n and p. Assuming that p has a prior distribution that is uniform on the interval [0,1] is quite different from assuming that log(p/(1-p)) has a uniform distribution on the real line, even though that is seen as quite natural parametrization in some applications.

I believe that we should use all available physical constraints, etc., in drawing conclusions from data. I just think we should avoid making things up. It's essentially impossible to use a prior without inadvertantly making things up in the process.

What's the worst prior?

Statistical Decision Theory gives a formalism for evaluating and comparing estimators.

Decision theoretical approach: grade estimators according to some loss functional.

Mean-squared error (MSE) is common, but can use, e.g., the length of confidence intervals as a loss functional.

An estimator

G: Rⁿ-->Y

d |--> G(d)

is a measurable mapping from data space Rⁿ to parameter space Y.
Y might be finite or infinite-dimensional.

A loss function

L: Y×T-->R⁺

(t, s) |--> L(t, s)

is a mapping from ordered pairs of members in the parameter space to the positive real numbers. It tells us how much it "costs" to estimate that the true parameter is s when it is in fact t. Things are usually set up so that L(t, t) = 0; that is, if the estimate is correct, we pay nothing. The choice of G is fairly arbitrary, but the context of the problem can suggest useful functions. One of the most common (because it's tractable by calculus) is squared-error loss, where in the case that Y is one-dimensional, G(t, s) = (t - s)².

If t be the true value of the parameter and our estimator gives G(d), we lose

L(t, G(d)).

Because the data contain random noise, the value L takes is random. Let P be the probability distribution of the random data errors. We define

R(t, G) = E_P(L(t, G(d)),

the expected value of the loss as the data errors vary.

Let T be a probability distribution on the parameter (theory) space Y of the experiment (we allow the possibility that T is a point mass at the single theory t), and let T~T. We define

R(T, G) = E_P,T(L(T, G(d)),

where E_P,T is expectation with respect to the joint distribution of the parameter and the data error.

The procedure G^B = G^B(T, L) with the smallest risk R(T, G) is the Bayes procedure for the prior T and loss functional L.

It's a theorem that for squared-error loss ( L(t, G(d)) = (t-G(d))² ), the Bayes procedure is the posterior mean.

However, the risk obviously depends on the prior as well as the loss function and the estimator. How much does the risk vary for different priors? Is the risk largest for "uninformative" priors?

Clearly, we can define a prior that makes the Bayes risk zero: just make the prior a point mass at a particular theory. For other priors, the risk can be higher. How much higher? Does small Bayes risk in a particular problem imply that the parameter is constrained well by the data, or could it be an artifact of the prior?

Suppose we know a priori that the true parameter t is in the constraint set C, a subset of Y.

In the Bayesian world, our prior T would presumably assign probability one to C in this case.

In the frequentist world, a standard approach is to consider the largest loss a given estimator has over the set C of a priori acceptable models; that is, for an estimator G we consider

R^*(C, G) = sup{R(t, G): t is in C}.

The procedure with the smallest maximum expected loss (when it exists) is the minimax procedure for that constraint set and loss functional.

Example: Estimating a Bounded Normal Mean

Suppose we observe X~N(a,1), and we know a priori that -b <= a <= b.

This is a one-dimensional version of a canonical inverse problem, for example, estimating the magnetic field using the constraint that the rest mass of the energy stored in the field is less than Earth's mass.

It can be shown (See Donoho, 1994, and references therein) that the minimax risk for squared-error loss for this problem is

R^** >= 4/5×b²/(b² + 1).

For a uniform prior on [-b, b], the posterior density of a given the observation x is

f(x-a)1_-b<=a<=b/(F(x+b)-F(x-b)),

where f(x) is the Gaussian density, and F(x) is the Gaussian cumulative distribution function (cdf).

Using Monte-Carlo simulation (1000 repetitions) to calculate the Bayes risk, I found

b	Bayes risk (simulated)	Minimax lower bound
0.5	0.08	0.16
1	0.25	0.40
2	0.55	0.64

The Bayes risk for the uniform prior is considerably smaller than the minimax risk. This can be interpreted to say that the impression of small uncertainty (as measured by mean squared error) in the Bayes estimate arises from the prior, rather than simply the "hard" constraint that -b <= a <= b.

Minimax estimators are Bayes estimators for the least favorable prior

In a wide range of problems, the minimax estimator is the Bayes estimator for a particular prior, the least favorable prior.

That is, there is a prior for which the Bayes risk is the same as the frequentist minimax computation would give; furthermore, that risk is at least as large as the risk under any other prior.

This says that if the minimax risk is small, so is the Bayes risk for every possible prior, while if the minimax risk is large, the choice of priors matters a great deal.

Rarely is the least favorable prior a flat, "uninformative" prior.

Thus the frequentist calculation of a minimax risk tells us the extent to which in a Bayesian analysis the choice of prior determines the apparent uncertainty.

Discretization and its discontents

You can't always get what you want.

It is often asserted in Bayesian and frequentist inversions that it's OK to truncate the model space to a finite-dimensional one, because data sets have limited resolution.

In both Bayesian and frequentist inversions, truncating the model space can give the erroneous impression that one can estimate properties of the model that are actually not at all constrained by the data---the finite apparent uncertainty is entirely an artifact of the parametrization, which is typically ad hoc.

No matter how much we would like to be able to estimate something, the wish does not make it possible

The heuristics applied are

"if the data don't resolve it, you don't need to model it" (the ostrich hypothesis)
"if you use smooth [simple, coarse-grained, blurred, ...] models in an inversion, you get a smoothed [simple, coarse-grained, blurred, ...] image of the Earth"

There is a profound difference between what you can't see and what isn't there.

Smoothed models aren't necessarily "blurry" versions of the truth. A "smoothed" (bandpassed) Dirac delta function has structure everywhere. Similarly the truncated spherical harmonic expansion of a "bump" on the core-mantle boundary has topography all over the core.

Backus's work on inverse problems (see especially 1989, 1996) is all geared to avoiding or accounting for the effects of discretization; see also Stark (1992bc) and Genovese and Stark (1996). These papers also present frequentist joint (simultaneous) confidence intervals for multiple parameters.

Bibliography

Backus, G.E., 1996. Truncation and procrastination as inverse techniques, PEPI, 98.
Backus, G.E., 1987. Isotropic probability measures in infinite-dimensional spaces, PNAS, 84, 8755-8757.
Backus, G.E., 1988a. Bayesian inference in geomagnetism, Geophys. J., 92, 125-142.
Backus, G.E., 1988b. Comparing hard and soft prior bounds in geophysical inverse problems, Geophys. J., 94, 249-261.
Backus, G.E., 1989. Confidence set inference with a prior quadratic bound, Geophys. J., 97, 119-150.
Donoho, D.L., 1994. Statistical estimation and optimal recovery. Ann. Stat., 22, 238-270.
Genovese, C.R. and Stark, P.B., 1996. Data reduction and statistical consistency in linear inverse problems, PEPI, 98, 143-162.
Good, I.J., 1965. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Research Monograph 30, MIT Press, Cambridge, MA.
Le Cam, L., 1986. Asymptotic methods in statistical decision theory, Springer-Verlag, New York.
Stark, P.B., 1992a. Affine minimax confidence intervals for a bounded Normal mean, Stat. & Prob. Lett., 13, 39-44.
Stark, P.B., 1992b. Minimax confidence intervals in geomagnetism, Geophys. J. Intl., 108, 329-338.
Stark, P.B., 1992c. Inference in infinite-dimensional inverse problems: Discretization and duality, J. Geophys. Res., 97, 14,055-14,082.

Does God play dice with the Earth? (And if so, are they loaded?)

Philip B. Stark Dept. of Statistics University of California Berkeley, CA 94720-3860 www.stat.berkeley.edu/~stark