Representing Certainties in Uncertainty Quantification: Constraints Versus Priors

P.B. Stark, Department of Statistics, UC Berkeley

JSM 2020, 5 August 2020

Somewhere (everywhere?) on the Internet

  • Is Statistics something you do to data? Is it procedural?
  • Ideally, it's a way of thinking to avoiding fooling yourself & others.
  • In many disciplines, "Statistics" is calculation, not thinking.
  • Consequence of how statistics is taught and of peverse incentives: Cargo-Cult Statistics.

Problem statement

  • Want to use data $Y \in \Re^n$ to learn about the (unknown) state of the world, $\theta$ in a mathematical model of physical system.
  • Often $\theta$ is a function of position and/or time: infinite-dimensional
  • "Know" that $\theta \in \Theta$.
  • measurement model: If $\theta = \eta$, $Y \sim P_\eta$.
  • Known measure $\mu$ that dominates all $P_\eta$ s.t. $\eta \in \Theta$.
  • Density of $P_\eta$ at $y$ w.r.t. $\mu$ is
$$ p_\eta(y) \equiv dP_\eta/d\mu |_y. $$
  • likelihood of $\eta$ given $Y = y$ is $p_\eta(y)$, viewed as function of $\eta$.
  • Typically impossible to estimate $\theta$ with any useful level of accuracy (maybe not even identifiable).
  • Generally possible and scientifically interesting to estimate some parameter $\lambda = \lambda[\theta]$

The Bayesian Approach

  • Uses prior probability distribution $\pi$ on $\Theta$ and likelihood $p_\eta(y)$.
  • Requires $\Theta$ to be a measurable subset of a measurable space and $p_\eta(y)$ to be jointly measurable w.r.t. $\eta$ and $y$.
  • $\pi$ and $p_\eta$ imply a joint distribution of $\theta$ and $Y$.
  • Marginal distribution or predictive distribution of $Y$ is
$$ m(y) = \int_\Theta p_\eta(y) \, \pi(d\eta). $$

Updating

  • Posterior distribution of $\theta$ given $Y=y$:
$$ \pi (d\eta | Y = y) = \frac{p_\eta(y) \; \pi(d\eta) }{m(y)} . $$
  • All the information in the prior and the data is in the posterior distribution.
  • Posterior distribution $\pi_\lambda(d \ell | Y = y)$ of $\lambda[\theta]$ is induced by posterior distribution of $\theta$:
$$ P(\lambda[\theta] \in A | Y = y) = \int_{\ell \in A} \pi_\lambda(d \ell | Y = y) \equiv \int_{\eta: \lambda[\eta] \in A} \pi(d \eta | Y = y). $$

Why use Bayesian methods?

  • Descriptive : people are Bayesian.
  • Normative : people should be Bayesian.
  • Practical : "data swamp the prior" so doesn't matter
  • My guess: popular because gives a general recipe and smaller error bars than frequentist methods.
  • But error bars don't have the same meaning.

Priors

  • To use Bayesian framework must quantify beliefs and constraints as a prior $\pi$.
  • Constraint $\theta \in \Theta$ captured as $\pi(\Theta) = 1$.
  • But infinitely many probability distributions assign probability 1 to $\Theta$.
  • I've never seen a Bayesian analysis of real data in which the data analyst made a serious attempt to quantify beliefs using a prior.

Priors selected or justified in ~5 ways:

  1. to make the calculations simple
  2. because the particular prior is conventional
  3. so that the prior satisfies some invariance
  4. with the assertion that the prior is "uninformative" (e.g., Laplace's principle)
  5. because the prior roughly matches the relative frequencies of values in some population.

Frequentist v. Bayesian

Main difference:

  • Frequentists treat $\theta$ as an unknown element of $\Theta$.

  • Bayesians treat $\theta$ as if drawn at random from $\Theta$ using $\pi$.

Bayesian approach requires much stronger assumptions.

Summarizing uncertainty

Consider Bayesian and frequentist versions of 2 summaries:

  • mean squared error (frequentist) and posterior mean squared error (Bayesian)

  • confidence sets (frequentist) and credible regions (Bayesian)

Mean Squared Error and Posterior Mean Squared Error

$$ \mbox{MSE}(\widehat{\lambda}(Y), \eta) \equiv E_\eta \| \widehat{\lambda}(Y) - \lambda[\eta] \|^2. $$$$ \mbox{PMSE}(\widehat{\lambda}(y), \pi) \equiv E_\pi \| \widehat{\lambda}(y) - \lambda[\eta] \|^2. $$
  • MSE is an expectation with respect to the distribution of the data $Y$, holding the parameter $\theta = \eta$ fixed.

  • PMSE is an expectation with respect to the posterior distribution of $\theta$, holding the data $Y = y$ fixed.

Confidence Sets and Credible Regions

A random set $I(Y)$ of possible values of $\lambda$ is a $1-\alpha$ confidence set for $\lambda[\theta]$ if $$ P_\eta \{ I(Y) \ni \lambda[\eta] \} \ge 1 - \alpha, \;\; \forall \eta \in \Theta. $$

Probability w.r.t. distribution of the data $Y$, holding $\eta$ fixed.

A set $I(y)$ of possible values of $\lambda$ is a $1-\alpha$ posterior credible region for $\lambda[\theta]$ if $$ P_{\pi( d\theta | Y=y)} (\lambda[\theta] \in I(y)) \equiv \int_{I(y)} \pi_\lambda(d \ell | Y = y) \ge 1-\alpha. $$

Probability w.r.t. marginal posterior distribution of $\lambda[\theta]$, holding the data fixed.

  • credible level: probability that by drawing from prior, nature generates an element of the set, given the data

  • confidence level: probability that procedure gives a set that contains the truth

Uncertainties have completely different interpretations

  • Frequentist: hold parameter constant, characterize behavior under repeated measurement

  • Bayesian: hold measurement constant, characterize behavior under repeatedly drawing parameter at random from the prior

Duality between Bayes and minimax approaches

  • Formal Bayesian uncertainty can be made as small as desired by choosing prior appropriately.
  • Under suitable conditions, the minimax frequentist risk is equal to the Bayes risk for the "least-favorable" prior.
  • If Bayes risk is less than minimax risk, prior is artificially reducing the (apparent) uncertainty. Regardless, means something different.
  • Least-favorable prior can be approximated numerically even for "black-box" numerical models, a la Schafer & Stark (2009)
  • Posterior uncertainty measures meaningful only if you believe prior

  • Changes the subject

  • Is the truth unknown? Is it a realization of a known probability distribution?

  • Where does prior come from?

    • Usually chosen for computational convenience or habit, not "physics"

    • Priors get their own literature

    • Eliciting priors deeply problemmatic

    • Why should I care about your posterior, if I don't share your prior?

  • How much does prior matter?

  • Slogan "the data swamp the prior." Theorem has conditions that aren't always met.

  • Is all uncertainty random?
  • Aleatory

    • Canonical examples: coin toss, die roll, lotto, roulette
    • under some circumstances, behave "as if" random (but not perfectly)
  • Epistemic: stuff we don't know

  • Standard way to combine aleatory variability epistemic uncertainty puts beliefs on a par with an unbiased physical measurement w/ known uncertainty.

  • Claims by introspection, can estimate without bias, with known accuracy, just as if one's brain were unbiased instrument with known accuracy

  • Bacon put this to rest, but empirically:

    • people are bad at making even rough quantitative estimates
    • quantitative estimates are usually biased
    • bias can be manipulated by anchoring, priming, etc.
    • people are bad at judging weights in their hands: biased by shape & density
    • people are bad at judging when something is random
    • people are overconfident in their estimates and predictions
    • confidence unconnected to actual accuracy.
    • anchoring affects entire disciplines (e.g., Millikan, c, Fe in spinach)
  • what if I don't trust your internal scale, or your assessment of its accuracy?

  • same observations that are factored in as "data" are also used to form beliefs: the "measurements" made by introspection are not independent of the data

  • Can grade Bayesian methods using frequentist criteria
  • E.g., what is the coverage probability of a credible region?

Two illustrations

  • Bounded normal mean

  • Election audits

Toy problem: bounded normal mean

  • Observe $Y \sim N(\theta, 1)$.

  • Know a priori that $\theta \in [-\tau, \tau]$

  • Bayes "uninformative" prior: $\theta \sim U[-\tau, \tau]$

Election audits

Check whether reported winner(s) really won by looking at random sample of ballots.

Absent convincing evidence that reported winners really won, keep looking.

Risk: probability that the audit does not correct the reported outcome.

Constraint: vote shares are non-negative, sum of shares $\le 1$.

Risk-limiting audit (frequentist)

Keep auditing until the (sequential) $P$-value of the hypothesis that the outcome is wrong is sufficiently small.

Known maximum risk, regardless of correct result.

Bayesian audit

Keep auditing until the conditional probability that the outcome is wrong, given the data, is sufficiently small.

Requires a prior.

"Nonpartisan" prior is invariant under permutations of the candidate names: "fair."

Includes "flat" or "uninformative" prior.

Minimax expected size confidence sets

  • Among all procedures for constructing a valid $1-\alpha$ confidence set for a parameter, which has the smallest worst-case expected size?

  • Exploit duality between Bayesian and frequentist methods: least-favorable prior.

It is inappropriate to be concerned about mice when there are tigers abroad.

—George Box

Commonly ignored sources of uncertainty:

  • Coding errors (ex: Hubble)

  • Stability of optimization algorithms (ex: GONG)

  • "upstream" data reduction steps

  • Quality of PRNGs (RANDU, but still an issue)

References

  • Stark, P.B., R.L. Parker, G. Masters, and J.A. Orcutt, 1986. Strict bounds on seismic velocity in the spherical Earth, Journal of Geophysical Research, 91, 13,892–13,902.

  • Stark, P.B., 1992. Minimax confidence intervals in geomagnetism, Geophysical Journal International, 108, 329–338. Stark, P.B., 1992. Inference in infinite-dimensional inverse problems: Discretization and duality, Journal of Geophysical Research, 97, 14,055–14,082. Reprint: http://onlinelibrary.wiley.com/doi/10.1029/92JB00739/epdf

  • Stark, P.B., 1993. Uncertainty of the COBE quadrupole detection, Astrophysical Journal Letters, 408, L73–L76.

  • Hengartner, N.W. and P.B. Stark, 1995. Finite-sample confidence envelopes for shape-restricted densities, The Annals of Statistics, 23, 525–550.

  • Genovese, C.R. and P.B. Stark, 1996. Data Reduction and Statistical Consistency in Linear Inverse Problems, Physics of the Earth and Planetary Interiors, 98, 143–162.

  • Tenorio, L., P.B. Stark, and C.H. Lineweaver, 1999. Bigger uncertainties and the Big Bang, Inverse Problems, 15, 329–341.

In [ ]: