(And if so, are they loaded?)

Dept. of Statistics

University of California

Berkeley, CA 94720-3860

Fourth **SIAM****
Conference
on
Mathematical and Computational Methods in the Geosciences
Albuquerque, NM
16 June 1997**

- Apologies to Einstein
- What's a Bayesian?
- Pragmatic versus epistemic inversion
- Should scientific computations be personal?
- Are parameters in your ontology?
- Are "uninformative" priors uninformative? You don't always want what you get (apologies to the Rolling Stones).
- What's the "worst" prior? (Uninformative vs. least favorable priors)
- Minimax estimators are Bayes for the least favorable prior
- Discretization and its discontents (apologies to Freud).

According to I.J. Good (1965),

"...the essential defining property of a
Bayesian is that *he regards it as meaningful to talk about
the probability ***P***(H|E) of a
hypothesis H, given evidence E."*

Personally, I don't think such statements make
sense: *H* is either true or false; there's nothing in
between, and there's no probability involved. Not knowing whether
a thing is true does not make its truth random.

Note the contrast between this situation and one
where *statements* are produced "at random." It
does make sense to me to talk about the probability that a random
statement is true (before the particular statement is produced).
This is the basis of frequentist confidence intervals.

The question of whether I should *act as if H
be true given evidence E* is also different: that is the
subject of statistical decision theory, including much of the
frequentist literature on hypothesis testing.

In my mind, there's a big distinction between, for example, exploration seismology, where the goal is to find oil, and whole-Earth seismology, where the goal is "philosophical," to learn about the structure and functioning of Earth.

In exploration seismology, it doesn't matter whether the methods one uses have a sound philosophical or mathematical foundation---what matters is whether or not they help one find oil. If you drill a well, you find out whether or not your inversion scheme "worked." If a method helps you find oil, that's enough justification for using it.

In whole-Earth seismology, you cannot drill to determine whether your image of CMB topography is accurate. The only justification for studying the deep interior is "philosophical:" wanting to know "the truth" about the interior of Earth. In such a situation, I think that methods of inference need to meet a higher standard of rigor.

Similarly, when one is "exploring" a data set, any visualization method that helps one formulate hypotheses or suggests new experiments that can help one gain insight has value. This branch of Statistics is called "exploratory data analysis" (EDA).

In contrast, if one wants to draw a statistical inference, for example, to test an hypothesis or to make a confidence interval, I think the methodology needs to meet higher standards of rigor---one should be able to demonstrate that over a range of realistic sets of assumptions, the methods will give reliable answers.

In my opinion, Bayesian methods are appropriate for EDA, and are appropriate in exploration seismology to the extent that one can show that they help find oil.

I think Bayesian methods are categorically inappropriate for inference in physical sciences. I'll try to show you why in this talk.

To tie mathematical probability calculations to the "real world" requires a "theory" (definition) of probability---a way to assign meaning to statements like "the probability of A is 1/3."

Bayesian theory is predicated on the Subjective Theory of Probability.

According to the Subjective Theory, probability
is a statement about strength of belief--probability is *defined
*in terms of states of mind.

In contrast, in the Frequency Theory, probability is defined in terms of long-term limiting relative frequencies of outcomes of experiments that are, in principle, repeatable by others.

In the subjective theory, "the probability of A is 1/3" is not a statement about A, it's a statement about the state of mind of the person making the assertion. (Of course, our state of mind can be influenced by beliefs about physics, etc., but it need not be.) Furthermore, there is no reason (and no requirement under the subjective theory) that different individuals should assign the same probability to the same event. There is a large literature on "thought experiments" by which one can determine the subjective probability of various events.

We usually claim in Science (even in California) to be learning about "objective reality," using experiments and procedures that can be replicated by others.

If the "thought experiments" to elicit
subjective probabilities are repeated by others, the result is
meaningless: the probabilities they assign need have no
correspondence with ours---their assignments are supposed to
reflect *their* states of mind, not ours. All that
Bayesian theory requires is that an individual's subjective
probabilities be internally consistent.

Bayesian data analysis is not "reproducible" by others, because others don't necessarily share the same prior.

If one uses Bayesian methods to draw conclusions from a given set of data, his results need not be meaningful to other investigators. They are not obliged to accept his conclusions unless they share his prior probability distribution (and trust that the computations were performed correctly). This seems to obviate the need for scientific journals for Bayesians---they are really speaking to an audience of one.

Either Bayesians need to use a variety of priors so that they have a hope of convincing not only themselves, but others too, or they need to provide a way for others to "plug in" their own priors and draw their own conclusions.

The only way out of this problem seems to be if it happens that the choice of prior doesn't matter.

The choice of prior *always* matters, but
there is a way to measure how much it matters (more later).

Do you have a weight?

If you weigh yourself on a scale, are you trying to refine your posterior (posterior distribution, as opposed to the distribution of your posterior!), or are you trying to figure out about how much you weigh?

Is "the true mass of the Earth at 0:00 GMT on 16 June 1997" a meaningful expression? Or is it only meaningful to talk about probability distributions for the mass of Earth?

In Bayesian data analysis, the primary quantities are probability distributions.

In Frequentist data analysis, the primary quantities are parameters. Probability enters into the design of experiments and the characterization of errors of measurement, but not the thing being estimated.

I believe that physical parameters can be meaningful, and that it makes sense to try to estimate them. I believe that at a particular time, I have a true weight, which a scale measures, but with error.

This attitude does lead to some circumlocutions, however. For example, it makes no sense to me to say "there is a 99% chance that the average radius of Earth is between 6370km and 6375km." At any given time, the average radius is some fixed distance, and the interval [6370km, 6375km] is a fixed range. Either the average radius is between 6370 and 6375km, or it isn't. There's no probability present.

In contrast, a Bayesian would have no trouble making a statement of that kind (see the statement by I.J. Good, above).

"Flat" priors are informative.

Backus (1987, 1988) shows that if you try to "soften" a quadratic bound to a probability distribution, you invariably get increasingly precise information about the exact value of the functional as the dimension of the model space grows.

There is no way to capture a strict bound (such as a bound on the energy stored in Earth's magnetic field) using a prior probability distribution without injecting new information not present in the bound. I'll give a one-dimensional example below to show that uniform priors are informative.Can you always express ignorance as a probability distribution, even an improper one?

There's a big difference between no knowledge, and a distribution that is uniform in some parametrization.

David Freedman will talk about this more.Consider the problem of estimating a probability

*p*from the number of successes*N*in*n*independent trials with the same probability*p*of success. The number of successes*N*has a Binomial distribution with parameters*n*and*p*. Assuming that*p*has a prior distribution that is uniform on the interval [0,1] is quite different from assuming that*log(p/(1-p))*has a uniform distribution on the real line, even though that is seen as quite natural parametrization in some applications.

I believe that we should use all available physical constraints, etc., in drawing conclusions from data. I just think we should avoid making things up. It's essentially impossible to use a prior without inadvertantly making things up in the process.

Statistical Decision Theory gives a formalism for evaluating and comparing estimators.

Decision theoretical approach: grade estimators
according to some *loss functional*.

Mean-squared error (MSE) is common, but can use, *e.g.*,
the length of confidence intervals as a loss functional.

An estimator

*G*: **R**^{n}-->**Y
**

**d |--> ***G*(**d**)

is a measurable mapping from data space **R**^{n}
to parameter space **Y**.

**Y** might be finite or infinite-dimensional.

A loss function

*L*: **Y**×**T**-->**R**^{+}

(**t**, **s**)
|--> *L*(**t**, **s**)

is a mapping from ordered pairs of members in the
parameter space to the positive real numbers. It tells us how
much it "costs" to estimate that the true parameter is **s**
when it is in fact **t**. Things are usually set up
so that *L*(**t**, **t**) = 0;
that is, if the estimate is correct, we pay nothing. The choice
of *G* is fairly arbitrary, but the context of the problem
can suggest useful functions. One of the most common (because
it's tractable by calculus) is squared-error loss, where in the
case that **Y** is one-dimensional, *G*(**t**,
**s**) = (**t** - **s**)^{2}.

If **t **be the true value of the
parameter and our estimator gives *G*(**d**),
we lose

*L*(**t**, *G*(**d**)).

Because the data contain random noise, the value *L*
takes is random. Let **P** be the probability
distribution of the random data errors. We define

*R*(**t**, *G*) = **E**_{P}(*L*(**t**,
*G*(**d**)),

the expected value of the loss as the data errors vary.

Let * T* be a probability
distribution on the parameter (theory) space

*R*(* T*,

where **E**_{P,T}
is expectation with respect to the joint distribution of the
parameter and the data error.

The procedure *G*^{B} = *G*^{B}(* T*,

It's a theorem that for squared-error loss ( *L*(**t**,
*G*(**d**)) = (**t**-G(**d**))^{2}
), the Bayes procedure is the posterior mean.

However, the risk obviously depends on the prior as well as the loss function and the estimator. How much does the risk vary for different priors? Is the risk largest for "uninformative" priors?

Clearly, we can define a prior that makes the Bayes risk zero: just make the prior a point mass at a particular theory. For other priors, the risk can be higher. How much higher? Does small Bayes risk in a particular problem imply that the parameter is constrained well by the data, or could it be an artifact of the prior?

Suppose we know *a priori *that the true
parameter **t** is in the constraint set **C**,
a subset of **Y**.

In the Bayesian world, our prior * T*
would presumably assign probability one to

In the frequentist world, a standard approach is
to consider the largest loss a given estimator has over the set **C**
of *a priori *acceptable models; that is, for an estimator
*G* we consider

*R*^{*}(**C**,
*G*) = sup{*R*(**t**, *G*): **t**
is in **C**}.

The procedure with the smallest maximum expected
loss (when it exists) is the **minimax*** procedure*
for that constraint set and loss functional.

Suppose we observe X~N(a,1), and we know *a
priori* that -b <= a <= b.

This is a one-dimensional version of a canonical inverse problem, for example, estimating the magnetic field using the constraint that the rest mass of the energy stored in the field is less than Earth's mass.

It can be shown (See Donoho, 1994, and references therein) that the minimax risk for squared-error loss for this problem is

*R*^{**}* >= *4/5×b^{2}/(b^{2}
+ 1).

For a uniform prior on [-b, b], the posterior density of a given the observation x is

f(x-a)**1**_{-b<=a<=b}/(F(x+b)-F(x-b)),

where f(x) is the Gaussian density, and F(x) is the Gaussian cumulative distribution function (cdf).

Using Monte-Carlo simulation (1000 repetitions) to calculate the Bayes risk, I found

b |
Bayes risk (simulated) |
Minimax lower bound |

0.5 | 0.08 | 0.16 |

1 | 0.25 | 0.40 |

2 | 0.55 | 0.64 |

The Bayes risk for the uniform prior is considerably smaller than the minimax risk. This can be interpreted to say that the impression of small uncertainty (as measured by mean squared error) in the Bayes estimate arises from the prior, rather than simply the "hard" constraint that -b <= a <= b.

**In a wide range of problems, the minimax
estimator is the Bayes estimator for a particular prior, the ****least
favorable prior****.**

That is, there is a prior for which the Bayes risk is the same as the frequentist minimax computation would give; furthermore, that risk is at least as large as the risk under any other prior.

This says that if the minimax risk is small, so is the Bayes risk for every possible prior, while if the minimax risk is large, the choice of priors matters a great deal.

**Rarely is the least favorable prior a
flat, "uninformative" prior.**

Thus the frequentist calculation of a minimax risk tells us the extent to which in a Bayesian analysis the choice of prior determines the apparent uncertainty.

It is often asserted in Bayesian and frequentist inversions that it's OK to truncate the model space to a finite-dimensional one, because data sets have limited resolution.

In both Bayesian and frequentist inversions,
truncating the model space can give the erroneous impression that
one can estimate properties of the model that are actually not at
all constrained by the data---the finite apparent uncertainty is
entirely an artifact of the parametrization, which is typically *ad
hoc*.

No matter how much we would like to be able to estimate something, the wish does not make it possible

The heuristics applied are

"if the data don't resolve it, you don't need to model it" (the ostrich hypothesis)

"if you use smooth [simple, coarse-grained, blurred, ...] models in an inversion, you get a smoothed [simple, coarse-grained, blurred, ...] image of the Earth"

There is a profound difference between what you can't see and what isn't there.

Smoothed models aren't necessarily "blurry" versions of the truth. A "smoothed" (bandpassed) Dirac delta function has structure everywhere. Similarly the truncated spherical harmonic expansion of a "bump" on the core-mantle boundary has topography all over the core.

Backus's work on inverse problems (see especially 1989, 1996) is all geared to avoiding or accounting for the effects of discretization; see also Stark (1992bc) and Genovese and Stark (1996). These papers also present frequentist joint (simultaneous) confidence intervals for multiple parameters.

**Bibliography**

- Backus, G.E., 1996. Truncation and procrastination as
inverse techniques,
*PEPI*,**98**. - Backus, G.E., 1987. Isotropic probability measures in
infinite-dimensional spaces,
*PNAS,***84,**8755-8757. - Backus, G.E., 1988a. Bayesian inference in geomagnetism,
*Geophys. J.*,**92**, 125-142. - Backus, G.E., 1988b. Comparing hard and soft prior bounds
in geophysical inverse problems,
*Geophys. J.,***94***,*249-261. - Backus, G.E., 1989. Confidence set inference with a prior
quadratic bound,
*Geophys. J.*,**97**, 119-150. - Donoho, D.L., 1994. Statistical estimation and optimal
recovery.
*Ann. Stat.*,**22**, 238-270. - Genovese, C.R. and Stark, P.B., 1996. Data reduction and
statistical consistency in linear inverse problems,
*PEPI*,**98**, 143-162. - Good, I.J., 1965.
*The Estimation of Probabilities: An Essay on Modern Bayesian Methods*, Research Monograph 30, MIT Press, Cambridge, MA. - Le Cam, L., 1986.
*Asymptotic methods in statistical decision theory*, Springer-Verlag, New York. - Stark, P.B., 1992a. Affine minimax confidence intervals
for a bounded Normal mean,
*Stat. & Prob. Lett.*,**13**, 39-44. - Stark, P.B., 1992b. Minimax confidence intervals in
geomagnetism,
*Geophys. J. Intl.*,**108**, 329-338. - Stark, P.B., 1992c. Inference in infinite-dimensional
inverse problems: Discretization and duality,
*J. Geophys. Res.*,**97**, 14,055-14,082.

Revised: 17 June 1997.