# 11.1 Overview of Applied MCMC

## 11.1.1 Summary

We give a brisk summary here, and expand upon some main ideas (the boldface phrases) in section 11.1.2.

Given a probability distribution $\pi$ on a space $S$, and a numerical quantity associated with $\pi$ (for instance, the mean $\bar{g}:=\sum_{x}\pi(x)g(x)$ or $:=\int g\ d\pi$, for specified $g:S\to R$), how can one estimate the numerical quantity using Monte Carlo (i.e. randomized algorithm) methods?

Asking such a question implicitly assumes we do not have a solution using mathematical analysis or efficient deterministic numerical methods. Exact Monte Carlo sampling presumes the ability to sample exactly from the target distribution $\pi$, enabling one to simulate an i.i.d. sequence $(X_{i})$ and then use classical statistical estimation, e.g. estimate $\bar{g}$ by $n^{-1}\sum_{i=1}^{n}g(X_{i})$. Where implementable, such exact sampling will typically be the best randomized algorithm. For one-dimensional distributions and a host of special distributions on higher-dimensional space or combinatorial structures, exact sampling methods have been devised. But it is unrealistic to expect there to be any exact sampling method which is effective in all settings. Markov Chain Monte Carlo sampling is based on the following idea.

First devise a Markov chain on $S$ whose stationary distribution is $\pi$. Simulate $n$ steps $X_{1},\ldots,X_{n}$ of the chain. Treat $X_{\tau^{*}},X_{\tau^{*}+1},\ldots,X_{n}$ as dependent samples from $\pi$ (where $\tau^{*}$ is some estimate of some mixing time) and then use these samples in a statistical estimator of the desired numerical quantity, where the confidence interval takes the dependence into account.

Variations of this basic idea include running multiple chains and introducing auxiliary variables (i.e. defining a chain on some product space $S\times A$). The basic scheme and variations are what make up the field of MCMC. Though there is no a priori reason why one must use reversible chains, in practice the need to achieve a target distribution $\pi$ as stationary distribution makes general constructions using reversibility very useful.

MCMC originated in statistical physics, but mathematical analysis of its uses there are too sophisticated for this book, so let us think instead of Bayesian statistics with high-dimensional data as the prototype setting for MCMC. So imagine a point $x\in R^{d}$ as recording $d$ numerical characteristics of an individual. So data on $n$ individuals is represented as a $n\times d$ matrix ${\bf x}=(x_{ij})$. As a model, we first take a parametric family $\phi(\theta,x)$ of probability densities; that is, $\theta\in R^{p}$ is a $p$-dimensional parameter and for each $\theta$ the function $x\to\phi(\theta,x)$ is a probability density on $R^{d}$. Finally, to make a Bayes model we take $\theta$ to have some probability density $h(\theta)$ on $R^{p}$. So the probability model for the data is: first choose $\theta$ according to $h(\cdot)$, then choose $(x_{i\cdot})$ i.i.d. with density $\phi(\theta,x)$. So there is a posterior distribution on $\theta$ specified by

 $f_{{\bf x}}(\theta):=\frac{h(\theta)\ \prod_{i=1}^{n}\phi(\theta,x_{i\cdot})}{% z_{{\bf x}}}$ (11.1)

where $z_{{\bf x}}$ is the normalizing constant. Our goal is to sample from $f_{{\bf x}}(\cdot)$, for purposes of e.g. estimating posterior means of real-valued parameters. An explicit instance of (11.1) is the hierarchical Normal model, but the general form of (11.1) exhibits features that circumscribe the type of chains it is feasible to implement in MCMC, as follows.

(i) Though the underlying functions $\phi(\cdot,\cdot),h(\cdot)$ which define the model may be mathematically simple, our target distribution $f_{{\bf x}}(\cdot)$ depends on actual numerical data (the data matrix ${\bf x}$), so it is hard to predict, and dangerous to assume, global regularity properties of $f_{{\bf x}}(\cdot)$.

(ii) The normalizing constant $z_{{\bf x}}$ is hard to compute, so we want to define chains which can be implemented without calculating $z_{{\bf x}}$.

The wide range of issues arising in MCMC can loosely be classified as “design” or “analysis” issues. Here “design” refers to deciding which chain to simulate, and “analysis” involves the interpretation of results. Let us start by discussing design issues. The most famous general-purpose method is the Metropolis scheme, of which the following is a simple implementation in setting (11.1). Fix a length scale parameter $l$. Define a step $\theta\to\theta^{(1)}$ of a chain as follows.

Pick $i$ uniformly from $\{1,2,\ldots,p\}$.

Pick $U$ uniformly from $[\theta_{i}-l,\theta_{i}+l]$.

Let $\theta^{\prime}$ be the $p$-vector obtained from $\theta$ by changing the $i$’th coordinate to $U$.

With probability $\min(1,f_{{\bf x}}(\theta^{\prime})/f_{{\bf x}}(\theta))$ set $\theta^{(1)}=\theta^{\prime}$;  else set $\theta^{(1)}=\theta$.

The target density enters the definition only via the ratios $f_{{\bf x}}(\theta^{\prime})/f_{{\bf x}}(\theta)$, so the value of $z_{{\bf x}}$ is not needed. The essence of a Metropolis scheme is that there is a proposal chain which proposes a move $\theta\to\theta^{\prime}$, and then an acceptance/rejection step which accepts or rejects the proposed move. See section 11.2.1 for the general definition, and proof that the stationary distribution is indeed the target distribution. There is considerable flexibility in the choice of proposal chain. One might replace the uniform proposal step by a Normal or symmetrized exponential or Cauchy jump; one might instead choose a random (i.e. isotropic) direction and propose to step some random distance in that direction (to make an isotropic Normal step, or a step uniform within a ball, for instance). There is no convincing theory to say which of these choices is better in general. However, in each proposal chain there is some length scale parameter $l$: there is a trade-off between making $l$ too small (proposals mostly accepted, but small steps imply slow mixing) and making $l$ too large (proposals rarely accepted), and in section 11.5 we give some theory (admittedly in an artificial setting) which does give guidance on choice of $l$.

The other well-known general MCMC method is exemplified by the Gibbs sampler. In the setting of (11.1), for $\theta=(\theta_{1},\ldots,\theta_{p})$ and $1\leq j\leq p$ write

 $f_{{\bf x},j,\theta}(v)=f_{{\bf x}}(\theta_{1},\ldots,\theta_{j-1},v,\theta_{j% +1},\ldots,\theta_{p}).$

A step $\theta\to\theta^{(1)}$ of the Gibbs sampler is defined as follows.

Pick $j$ uniformly from $\{1,2,\ldots,p\}$.

Pick $V$ from the density on $R^{1}$ proportional to $f_{{\bf x},j,\theta}(v)$.

Let $\theta^{(1)}$ be $\theta$ with its $j$’th coordinate replaced by $V$.

The heuristic appeal of the Gibbs sampler, compared to a Metropolis scheme, is that in the latter one typically considers only small proposal moves (lest proposals be almost always rejected) whereas in the Gibbs sampler one samples over an infinite line, which may permit larger moves. The disadvantage is that sampling along the desired one-dimensional line may not be easy to implement (see section 11.1.2). Closely related to the Gibbs sampler is the hit-and-run sampler, where one takes a random (isotropic) direction line instead of a coordinate line; section 11.2.2 abstracts the properties of such line samplers, and section 11.3 continues this design topic to discuss more complex designs of chains which attain a specified target distribution as their stationary distribution.

We now turn to analysis issues, and focus on the simplest type of problem, obtaining an estimate for an expectation $\bar{g}=\sum g({\bf x})\pi({\bf x})$ using an irreducible chain $(X_{t})$ designed to have stationary distribution $\pi$. How do we obtain an estimate, and how accurate is it? The most straightforward approach is single-run estimation. The asymptotic variance rate is

 $\sigma^{2}:=\lim_{t\to\infty}t^{-1}{\rm var}\ \left(\sum_{s=1}^{t}g(X_{s})% \right)=\sum_{s=-\infty}^{\infty}\mbox{cov}_{\pi}(g(X_{0}),g(X_{s})).$ (11.2)

So simulate a single run of the chain, from some initial state, for some large number $t$ of steps. Estimate $\bar{g}$ by

 $\hat{g}=\frac{1}{t-t_{0}}\sum_{i=t_{0}+1}^{t}g(X_{i})$ (11.3)

and estimate the variance of $\hat{g}$ by $(t-t_{0})^{-1}\hat{\sigma}^{2}$, and report a confidence interval for $\hat{g}$ by assuming $\hat{g}$ has Normal distribution with mean $\bar{g}$ and the estimated variance. Here $\hat{\sigma}^{2}$ is an estimate of $\sigma^{2}$ obtained by treating the sample covariances $\hat{\gamma}_{s}$ (i.e. the covariance of the data-set $(g(X_{i}),g(X_{i+s}));0\leq i\leq t-s$) as estimators of $\gamma_{s}=\mbox{cov}_{\pi}(g(X_{0}),g(X_{s}))$. And the burn-in time $t_{0}$ is chosen as a time after which the $\hat{\gamma}_{s}$ become small.

Though the practical relevance of theoretical mixing time parameters is debatable, one can say loosely that single-run estimates based on $t$ steps will work fine if $t$ is large compared to the relaxation time $\tau_{2}$. The difficulty is that in practical MCMC problems we do not know, or have reasonable upper bounds on, $\tau_{2}$, nor can we estimate $\tau_{2}$ rigorously from simulations. The difficulty in diagnosing convergence from simulations is the possibility of metastability error caused by multimodality. Using statistical physics imagery, the region around each mode is a potential well, and the stationary distribution conditioned to a potential well is a metastable distribution. Believing that a simulation reaches the stationary distribution when in fact it only reaches a metastable distribution is the metastability error.

The simplest way to try to guard against metastability error is the multiple trials diagnostic. Here we run $k$ independent copies of the chain from different starting states, each for $t$ steps. One diagnostic is to calculate the $k$ sample averages $\hat{g}_{j}$, and check that the empirical s.d. of these $k$ averages is consistent with the estimated s.d. $(t-t_{0})^{-1/2}\hat{\sigma}$. Intuitively, one chooses the initial states to be “overdispersed”, i.e. more spread out than we expect the target distribution to be; passing the diagnostic test gives us some reassurance against metastability error (if there were different potential wells, we hope our runs would find more than one well, and that different behavior of $g$ on different wells would be manifest).

Of course, if one intends to perform such diagnostics it makes sense to start out doing the $k$ multiple runs. A more elaborate procedure is to divide $[0,t]$ into $L$ successive blocks, and seek to check whether the $kL$ blocks “look similar”. This can be treated as a classical topic in statistics (“analysis of variance”). In brief, we compute the sample mean $\hat{g}_{i,j}$ and sample variance $\hat{\sigma}^{2}_{i,j}$ for the $j$’th block of the $i$’th simulation, and see if this data (perhaps after deleting the first few blocks of each simulation) is consistent with the blocks being i.i.d.. If so, we use the overall average as an estimator of $\bar{g}$, and estimate the accuracy of this estimator by assuming the blocks were independent.

If a multiple-runs diagnostic fails, or if one lacks confidence in one’s ability to choose a small number of starting points which might be attracted to different nodes (if such existed), then one can seek schemes specially adapted to multimodal target densities. Because it is easy to find local maxima of a target density $f$, e.g. by a deterministic hill-climbing algorithm, one can find modes by repeating such an algorithm from many initial states, to try to find an exhaustive list of modes with relatively high $f$-values. This is mode-hunting; one can then design a chain tailored to jump between the wells with non-vanishing probabilities. Such methods are highly problem-specific; more general methods (such as the multi-level or multi-particle schemes of sections 11.3.3 and 11.3.4) seek to automate the search for relevant modes within MCMC instead of having a separate mode-hunting stage.

In seeking theoretical analysis of MCMC one faces an intrinsic difficulty: MCMC is only needed on “hard” problems, but such problems are difficult to study. In comparing effectiveness of different variants of MCMC it is natural to say “forget about theory – just see what works best on real examples”. But such experimental evaluation is itself conceptually difficult: pragmatism is easier in theory than in practice!

## 11.1.2 Further aspects of applied MCMC

Sampling from one-dimensional distributions. Consider a probability distribution $\mu$ on $R^{1}$ with density function $f$ and and distribution function $F$. In one sense, sampling from $\mu$ is easy, because of the elementary result that $F^{-1}(U)$ has distribution $\mu$, where $U$ is uniform on $[0,1]$ and $x=F^{-1}(u)$ is the inverse function of $u=F(x)$. In cases where we have an explicit formula for $F^{-1}$, we are done. Many other cases can be done using rejection sampling. Suppose there is some other density $g$ from which we can sample by the inverse distribution function method, and suppose we know a bound $c\geq\sup_{x}f(x)/g(x)$. Then the algorithm

propose a sample $x$ from $g(\cdot)$;

accept $x$ with probability $\frac{f(x)}{cg(x)}$; else propose a new sample from $g$

produces an output with density $f(\cdot)$ after mean $c$ steps. By combining these two methods, libraries of algorithms for often-encountered one-dimensional distributions can be built, and indeed exist in statistical software packages.

But what about a general density $f(x)$? If we need to sample many times from the same density, it is natural to use deterministic numerical methods. First probe $f$ at many values of $x$. Then either

(a) build up a numerical approximation to $F$ and thence to $F^{-1}$; or

(b) choose from a library a suitable density $g$ and use rejection sampling.

The remaining case, which is thus the only “hard” aspect of sampling from one-dimensional distributions, is where we only need one sample from a general distribution. In other words, where we want many samples which are all from different distributions. This is exactly the setting of the Gibbs sampler where the target multidimensional density is complicated, and thus motivates some of the variants we discuss in section 11.3.

Practical relevance of theoretical mixing time parameters. Standard theory from Chapter 4 (yyy cross-refs) relates $\tau_{2}$ to the asymptotic variance rate $\sigma^{2}(g)$ at (11.2) for the “worst-case” $g$:

 $\tau_{2}=\frac{1}{1-\lambda_{2}}\approx\frac{1+\lambda_{2}}{1-\lambda_{2}}=% \sup_{g}\frac{\sigma^{2}(g)}{{\rm var}\ _{\pi}g}.$ (11.4)

Moreover Proposition 29 of Chapter 4 (yyy 10/11/94 version) shows that $\sigma^{2}(g)$ also appears in an upper bound on variances of finite-time averages from the stationary chain. So in asking how long to run MCMC simulations, a natural principle (not practical, of course, because we typically don’t know $\tau_{2}$) is

base estimates on $t$ steps, where $t$ is a reasonable large multiple of $\tau_{2}$.

But this principle can be attacked from opposite directions. It is sometimes argued that worrying about $\tau_{2}$ (corresponding to the worst-case $g$) is overly pessimistic in the context of studying some specific $g$. For instance, Sokal [311] p. 8 remarks that in natural statistical physics models on the infinite lattice near a phase transition in a parameter $\theta$, as $\theta$ tends to the critical point the growth exponent of $\sigma^{2}(g)$ for “interesting” $g$ is typically different from the growth exponent of $\tau_{2}$. Madras and Slade [252] p. 326 make similar remarks in the context of the pivot algorithm for self-avoiding walk. But we do not know similar examples in the statistical $R^{d}$ setting. In particular, in the presence of multimodality such counterexamples would require that $g$ be essentially “orthogonal” to the differences between modes, which seems implausible.

Burn-in, the time $t_{0}$ excluded from the estimator (11.3) to avoid undue influence of initial state, is conceptually more problematic. Theory says that taking $t_{0}$ as a suitable multiple of $\tau_{1}$ would guarantee reliable estimates. The general fact $\tau_{1}\geq\tau_{2}$ then suggests that allowing sufficient burn-in time is a stronger requirement than allowing enough “mixing” for the stationary chain – so the principle above is overly optimistic. On the other hand, because it refers to worst-case initial state, requiring a burn-in time of $\tau_{1}$ seems far too conservative in practice. The bottom line is that one cannot eliminate the possibility of metastability error; in general, all one gets from multiple-runs and diagnostics is confidence that one is sampling from a single potential well, in the imagery below (though section 11.6.2 indicates a special setting where we can do better).

Statistical physics imagery. Any probability distribution $\pi$ can be written as

 $\pi(x)\propto\exp(-H(x)).$

One can call $H$ a potential function; note that a mode (local maximum) of $\pi$ is a local minimum of $H$. One can envisage a realization of a Markov chain as a particle moving under the influence of both a potential function (the particle responds to some “force” pushing it towards lower values of $H$) and random noise. Associated with each local minimum $y$ of $H$ is a potential well, which we envisage as the set of points which under the influence of the potential only (without noise) the particle would move to $y$ (in terms of $\pi$, states from which a “steepest ascent” path leads to $y$).

A fundamental intuitive picture is that the main reason why a reversible chain may relax slowly is that there is more than one potential well, and the chain takes a long time to move from one well to another. In such a case, $\pi$ conditioned to a single potential well will be a metastable (i.e. almost-stationary) distribution. One expects the chain’s distribution, from any initial state, to reach fairly quickly one (or a mixture) of these metastable distributions, and then the actual relaxation time to stationarity is dominated by the times taken to move between wells. In more detail, if there are $w$ wells then one can consider, as a coarse-grained approximation, a $w$-state continuous-time chain where the transition rates $w_{1}\to w_{2}$ are the rates of moving from well $w_{1}$ to well $w_{2}$. Then $\tau_{2}$ for the original chain should be closely approximated by $\tau_{2}$ for the coarse-grained chain.

The hierarchical Normal model. As a very simple instance of (11.1), take $d=1,p=2$ and $x\to\phi(\mu,\sigma^{2},x)$ the Normal$(\mu,\sigma^{2})$ density. Then let $(\mu,\sigma)$ be chosen independently for each individual from some joint density $h(\mu,\sigma)$ on $R\times R^{+}$. The data is an $n$-vector ${\bf x}=(x_{1},\ldots,x_{n})$ and the full posterior distribution is

 $f_{{\bf x}}(\mu_{1},\ldots,\mu_{n},\sigma_{1},\ldots,\sigma_{n})=z_{{\bf x}}^{% -1}\prod_{i=1}^{n}h(\mu_{i},\sigma_{i})\phi(\mu_{i},\sigma_{i}^{2},x_{i}).$

Typically we are interested in a posterior mean of $\mu_{i}$ for fixed $i$, that is $\bar{g}$ for

 $g(\mu_{1},\ldots,\mu_{n},\sigma_{1},\ldots,\sigma_{n}):=\mu_{i}.$

Pragmatism is easier in theory than in practice. In comparing MCMC methods experimentally, one obvious issue is the choice of example to study. Another issue is that, if we measure “time” as “number of steps”, then a step of one chain may not be comparable with a step of another chain. For instance, a Metropolis step is typically easier to implement than a Gibbs step. More subtlely, in combinatorial examples there may be different ways to set up a data structure to represent the current state in a way that permits easy computation of $\pi$-values. The alternative of measuring “time” as CPU time introduces different problems – details of coding matter.