11 Markov Chain Monte Carlo (January 8 2001)

11.1 Overview of Applied MCMC

11.1.1 Summary

We give a brisk summary here, and expand upon some main ideas (the boldface phrases) in section 11.1.2.

Abstractly, we start with the following type of problem.

Given a probability distribution π\pi on a space SS, and a numerical quantity associated with π\pi (for instance, the mean g¯:=xπ(x)g(x)\bar{g}:=\sum_{x}\pi(x)g(x) or :=gdπ:=\int g\ d\pi, for specified g:SRg:S\to R), how can one estimate the numerical quantity using Monte Carlo (i.e. randomized algorithm) methods?

Asking such a question implicitly assumes we do not have a solution using mathematical analysis or efficient deterministic numerical methods. Exact Monte Carlo sampling presumes the ability to sample exactly from the target distribution π\pi, enabling one to simulate an i.i.d. sequence (Xi)(X_{i}) and then use classical statistical estimation, e.g. estimate g¯\bar{g} by n-1i=1ng(Xi)n^{-1}\sum_{i=1}^{n}g(X_{i}). Where implementable, such exact sampling will typically be the best randomized algorithm. For one-dimensional distributions and a host of special distributions on higher-dimensional space or combinatorial structures, exact sampling methods have been devised. But it is unrealistic to expect there to be any exact sampling method which is effective in all settings. Markov Chain Monte Carlo sampling is based on the following idea.

First devise a Markov chain on SS whose stationary distribution is π\pi. Simulate nn steps X1,,XnX_{1},\ldots,X_{n} of the chain. Treat Xτ*,Xτ*+1,,XnX_{\tau^{*}},X_{\tau^{*}+1},\ldots,X_{n} as dependent samples from π\pi (where τ*\tau^{*} is some estimate of some mixing time) and then use these samples in a statistical estimator of the desired numerical quantity, where the confidence interval takes the dependence into account.

Variations of this basic idea include running multiple chains and introducing auxiliary variables (i.e. defining a chain on some product space S×AS\times A). The basic scheme and variations are what make up the field of MCMC. Though there is no a priori reason why one must use reversible chains, in practice the need to achieve a target distribution π\pi as stationary distribution makes general constructions using reversibility very useful.

MCMC originated in statistical physics, but mathematical analysis of its uses there are too sophisticated for this book, so let us think instead of Bayesian statistics with high-dimensional data as the prototype setting for MCMC. So imagine a point xRdx\in R^{d} as recording dd numerical characteristics of an individual. So data on nn individuals is represented as a n×dn\times d matrix 𝐱=(xij){\bf x}=(x_{ij}). As a model, we first take a parametric family ϕ(θ,x)\phi(\theta,x) of probability densities; that is, θRp\theta\in R^{p} is a pp-dimensional parameter and for each θ\theta the function xϕ(θ,x)x\to\phi(\theta,x) is a probability density on RdR^{d}. Finally, to make a Bayes model we take θ\theta to have some probability density h(θ)h(\theta) on RpR^{p}. So the probability model for the data is: first choose θ\theta according to h()h(\cdot), then choose (xi)(x_{i\cdot}) i.i.d. with density ϕ(θ,x)\phi(\theta,x). So there is a posterior distribution on θ\theta specified by

f𝐱(θ):=h(θ)i=1nϕ(θ,xi)z𝐱f_{{\bf x}}(\theta):=\frac{h(\theta)\ \prod_{i=1}^{n}\phi(\theta,x_{i\cdot})}{% z_{{\bf x}}} (11.1)

where z𝐱z_{{\bf x}} is the normalizing constant. Our goal is to sample from f𝐱()f_{{\bf x}}(\cdot), for purposes of e.g. estimating posterior means of real-valued parameters. An explicit instance of (11.1) is the hierarchical Normal model, but the general form of (11.1) exhibits features that circumscribe the type of chains it is feasible to implement in MCMC, as follows.

(i) Though the underlying functions ϕ(,),h()\phi(\cdot,\cdot),h(\cdot) which define the model may be mathematically simple, our target distribution f𝐱()f_{{\bf x}}(\cdot) depends on actual numerical data (the data matrix 𝐱{\bf x}), so it is hard to predict, and dangerous to assume, global regularity properties of f𝐱()f_{{\bf x}}(\cdot).

(ii) The normalizing constant z𝐱z_{{\bf x}} is hard to compute, so we want to define chains which can be implemented without calculating z𝐱z_{{\bf x}}.

The wide range of issues arising in MCMC can loosely be classified as “design” or “analysis” issues. Here “design” refers to deciding which chain to simulate, and “analysis” involves the interpretation of results. Let us start by discussing design issues. The most famous general-purpose method is the Metropolis scheme, of which the following is a simple implementation in setting (11.1). Fix a length scale parameter ll. Define a step θθ(1)\theta\to\theta^{(1)} of a chain as follows.

Pick ii uniformly from {1,2,,p}\{1,2,\ldots,p\}.

Pick UU uniformly from [θi-l,θi+l][\theta_{i}-l,\theta_{i}+l].

Let θ\theta^{\prime} be the pp-vector obtained from θ\theta by changing the ii’th coordinate to UU.

With probability min(1,f𝐱(θ)/f𝐱(θ))\min(1,f_{{\bf x}}(\theta^{\prime})/f_{{\bf x}}(\theta)) set θ(1)=θ\theta^{(1)}=\theta^{\prime};  else set θ(1)=θ\theta^{(1)}=\theta.

The target density enters the definition only via the ratios f𝐱(θ)/f𝐱(θ)f_{{\bf x}}(\theta^{\prime})/f_{{\bf x}}(\theta), so the value of z𝐱z_{{\bf x}} is not needed. The essence of a Metropolis scheme is that there is a proposal chain which proposes a move θθ\theta\to\theta^{\prime}, and then an acceptance/rejection step which accepts or rejects the proposed move. See section 11.2.1 for the general definition, and proof that the stationary distribution is indeed the target distribution. There is considerable flexibility in the choice of proposal chain. One might replace the uniform proposal step by a Normal or symmetrized exponential or Cauchy jump; one might instead choose a random (i.e. isotropic) direction and propose to step some random distance in that direction (to make an isotropic Normal step, or a step uniform within a ball, for instance). There is no convincing theory to say which of these choices is better in general. However, in each proposal chain there is some length scale parameter ll: there is a trade-off between making ll too small (proposals mostly accepted, but small steps imply slow mixing) and making ll too large (proposals rarely accepted), and in section 11.5 we give some theory (admittedly in an artificial setting) which does give guidance on choice of ll.

The other well-known general MCMC method is exemplified by the Gibbs sampler. In the setting of (11.1), for θ=(θ1,,θp)\theta=(\theta_{1},\ldots,\theta_{p}) and 1jp1\leq j\leq p write

f𝐱,j,θ(v)=f𝐱(θ1,,θj-1,v,θj+1,,θp).f_{{\bf x},j,\theta}(v)=f_{{\bf x}}(\theta_{1},\ldots,\theta_{j-1},v,\theta_{j% +1},\ldots,\theta_{p}).

A step θθ(1)\theta\to\theta^{(1)} of the Gibbs sampler is defined as follows.

Pick jj uniformly from {1,2,,p}\{1,2,\ldots,p\}.

Pick VV from the density on R1R^{1} proportional to f𝐱,j,θ(v)f_{{\bf x},j,\theta}(v).

Let θ(1)\theta^{(1)} be θ\theta with its jj’th coordinate replaced by VV.

The heuristic appeal of the Gibbs sampler, compared to a Metropolis scheme, is that in the latter one typically considers only small proposal moves (lest proposals be almost always rejected) whereas in the Gibbs sampler one samples over an infinite line, which may permit larger moves. The disadvantage is that sampling along the desired one-dimensional line may not be easy to implement (see section 11.1.2). Closely related to the Gibbs sampler is the hit-and-run sampler, where one takes a random (isotropic) direction line instead of a coordinate line; section 11.2.2 abstracts the properties of such line samplers, and section 11.3 continues this design topic to discuss more complex designs of chains which attain a specified target distribution as their stationary distribution.

We now turn to analysis issues, and focus on the simplest type of problem, obtaining an estimate for an expectation g¯=g(𝐱)π(𝐱)\bar{g}=\sum g({\bf x})\pi({\bf x}) using an irreducible chain (Xt)(X_{t}) designed to have stationary distribution π\pi. How do we obtain an estimate, and how accurate is it? The most straightforward approach is single-run estimation. The asymptotic variance rate is

σ2:=limtt-1var(s=1tg(Xs))=s=-covπ(g(X0),g(Xs)).\sigma^{2}:=\lim_{t\to\infty}t^{-1}{\rm var}\ \left(\sum_{s=1}^{t}g(X_{s})% \right)=\sum_{s=-\infty}^{\infty}\mbox{cov}_{\pi}(g(X_{0}),g(X_{s})). (11.2)

So simulate a single run of the chain, from some initial state, for some large number tt of steps. Estimate g¯\bar{g} by

g^=1t-t0i=t0+1tg(Xi)\hat{g}=\frac{1}{t-t_{0}}\sum_{i=t_{0}+1}^{t}g(X_{i}) (11.3)

and estimate the variance of g^\hat{g} by (t-t0)-1σ^2(t-t_{0})^{-1}\hat{\sigma}^{2}, and report a confidence interval for g^\hat{g} by assuming g^\hat{g} has Normal distribution with mean g¯\bar{g} and the estimated variance. Here σ^2\hat{\sigma}^{2} is an estimate of σ2\sigma^{2} obtained by treating the sample covariances γ^s\hat{\gamma}_{s} (i.e. the covariance of the data-set (g(Xi),g(Xi+s));0it-s(g(X_{i}),g(X_{i+s}));0\leq i\leq t-s) as estimators of γs=covπ(g(X0),g(Xs))\gamma_{s}=\mbox{cov}_{\pi}(g(X_{0}),g(X_{s})). And the burn-in time t0t_{0} is chosen as a time after which the γ^s\hat{\gamma}_{s} become small.

Though the practical relevance of theoretical mixing time parameters is debatable, one can say loosely that single-run estimates based on tt steps will work fine if tt is large compared to the relaxation time τ2\tau_{2}. The difficulty is that in practical MCMC problems we do not know, or have reasonable upper bounds on, τ2\tau_{2}, nor can we estimate τ2\tau_{2} rigorously from simulations. The difficulty in diagnosing convergence from simulations is the possibility of metastability error caused by multimodality. Using statistical physics imagery, the region around each mode is a potential well, and the stationary distribution conditioned to a potential well is a metastable distribution. Believing that a simulation reaches the stationary distribution when in fact it only reaches a metastable distribution is the metastability error.

The simplest way to try to guard against metastability error is the multiple trials diagnostic. Here we run kk independent copies of the chain from different starting states, each for tt steps. One diagnostic is to calculate the kk sample averages g^j\hat{g}_{j}, and check that the empirical s.d. of these kk averages is consistent with the estimated s.d. (t-t0)-1/2σ^(t-t_{0})^{-1/2}\hat{\sigma}. Intuitively, one chooses the initial states to be “overdispersed”, i.e. more spread out than we expect the target distribution to be; passing the diagnostic test gives us some reassurance against metastability error (if there were different potential wells, we hope our runs would find more than one well, and that different behavior of gg on different wells would be manifest).

Of course, if one intends to perform such diagnostics it makes sense to start out doing the kk multiple runs. A more elaborate procedure is to divide [0,t][0,t] into LL successive blocks, and seek to check whether the kLkL blocks “look similar”. This can be treated as a classical topic in statistics (“analysis of variance”). In brief, we compute the sample mean g^i,j\hat{g}_{i,j} and sample variance σ^i,j2\hat{\sigma}^{2}_{i,j} for the jj’th block of the ii’th simulation, and see if this data (perhaps after deleting the first few blocks of each simulation) is consistent with the blocks being i.i.d.. If so, we use the overall average as an estimator of g¯\bar{g}, and estimate the accuracy of this estimator by assuming the blocks were independent.

If a multiple-runs diagnostic fails, or if one lacks confidence in one’s ability to choose a small number of starting points which might be attracted to different nodes (if such existed), then one can seek schemes specially adapted to multimodal target densities. Because it is easy to find local maxima of a target density ff, e.g. by a deterministic hill-climbing algorithm, one can find modes by repeating such an algorithm from many initial states, to try to find an exhaustive list of modes with relatively high ff-values. This is mode-hunting; one can then design a chain tailored to jump between the wells with non-vanishing probabilities. Such methods are highly problem-specific; more general methods (such as the multi-level or multi-particle schemes of sections 11.3.3 and 11.3.4) seek to automate the search for relevant modes within MCMC instead of having a separate mode-hunting stage.

In seeking theoretical analysis of MCMC one faces an intrinsic difficulty: MCMC is only needed on “hard” problems, but such problems are difficult to study. In comparing effectiveness of different variants of MCMC it is natural to say “forget about theory – just see what works best on real examples”. But such experimental evaluation is itself conceptually difficult: pragmatism is easier in theory than in practice!

11.1.2 Further aspects of applied MCMC

Sampling from one-dimensional distributions. Consider a probability distribution μ\mu on R1R^{1} with density function ff and and distribution function FF. In one sense, sampling from μ\mu is easy, because of the elementary result that F-1(U)F^{-1}(U) has distribution μ\mu, where UU is uniform on [0,1][0,1] and x=F-1(u)x=F^{-1}(u) is the inverse function of u=F(x)u=F(x). In cases where we have an explicit formula for F-1F^{-1}, we are done. Many other cases can be done using rejection sampling. Suppose there is some other density gg from which we can sample by the inverse distribution function method, and suppose we know a bound csupxf(x)/g(x)c\geq\sup_{x}f(x)/g(x). Then the algorithm

propose a sample xx from g()g(\cdot);

accept xx with probability f(x)cg(x)\frac{f(x)}{cg(x)}; else propose a new sample from gg

produces an output with density f()f(\cdot) after mean cc steps. By combining these two methods, libraries of algorithms for often-encountered one-dimensional distributions can be built, and indeed exist in statistical software packages.

But what about a general density f(x)f(x)? If we need to sample many times from the same density, it is natural to use deterministic numerical methods. First probe ff at many values of xx. Then either

(a) build up a numerical approximation to FF and thence to F-1F^{-1}; or

(b) choose from a library a suitable density gg and use rejection sampling.

The remaining case, which is thus the only “hard” aspect of sampling from one-dimensional distributions, is where we only need one sample from a general distribution. In other words, where we want many samples which are all from different distributions. This is exactly the setting of the Gibbs sampler where the target multidimensional density is complicated, and thus motivates some of the variants we discuss in section 11.3.

Practical relevance of theoretical mixing time parameters. Standard theory from Chapter 4 (yyy cross-refs) relates τ2\tau_{2} to the asymptotic variance rate σ2(g)\sigma^{2}(g) at (11.2) for the “worst-case” gg:

τ2=11-λ21+λ21-λ2=supgσ2(g)varπg.\tau_{2}=\frac{1}{1-\lambda_{2}}\approx\frac{1+\lambda_{2}}{1-\lambda_{2}}=% \sup_{g}\frac{\sigma^{2}(g)}{{\rm var}\ _{\pi}g}. (11.4)

Moreover Proposition 29 of Chapter 4 (yyy 10/11/94 version) shows that σ2(g)\sigma^{2}(g) also appears in an upper bound on variances of finite-time averages from the stationary chain. So in asking how long to run MCMC simulations, a natural principle (not practical, of course, because we typically don’t know τ2\tau_{2}) is

base estimates on tt steps, where tt is a reasonable large multiple of τ2\tau_{2}.

But this principle can be attacked from opposite directions. It is sometimes argued that worrying about τ2\tau_{2} (corresponding to the worst-case gg) is overly pessimistic in the context of studying some specific gg. For instance, Sokal [311] p. 8 remarks that in natural statistical physics models on the infinite lattice near a phase transition in a parameter θ\theta, as θ\theta tends to the critical point the growth exponent of σ2(g)\sigma^{2}(g) for “interesting” gg is typically different from the growth exponent of τ2\tau_{2}. Madras and Slade [252] p. 326 make similar remarks in the context of the pivot algorithm for self-avoiding walk. But we do not know similar examples in the statistical RdR^{d} setting. In particular, in the presence of multimodality such counterexamples would require that gg be essentially “orthogonal” to the differences between modes, which seems implausible.

Burn-in, the time t0t_{0} excluded from the estimator (11.3) to avoid undue influence of initial state, is conceptually more problematic. Theory says that taking t0t_{0} as a suitable multiple of τ1\tau_{1} would guarantee reliable estimates. The general fact τ1τ2\tau_{1}\geq\tau_{2} then suggests that allowing sufficient burn-in time is a stronger requirement than allowing enough “mixing” for the stationary chain – so the principle above is overly optimistic. On the other hand, because it refers to worst-case initial state, requiring a burn-in time of τ1\tau_{1} seems far too conservative in practice. The bottom line is that one cannot eliminate the possibility of metastability error; in general, all one gets from multiple-runs and diagnostics is confidence that one is sampling from a single potential well, in the imagery below (though section 11.6.2 indicates a special setting where we can do better).

Statistical physics imagery. Any probability distribution π\pi can be written as

π(x)exp(-H(x)).\pi(x)\propto\exp(-H(x)).

One can call HH a potential function; note that a mode (local maximum) of π\pi is a local minimum of HH. One can envisage a realization of a Markov chain as a particle moving under the influence of both a potential function (the particle responds to some “force” pushing it towards lower values of HH) and random noise. Associated with each local minimum yy of HH is a potential well, which we envisage as the set of points which under the influence of the potential only (without noise) the particle would move to yy (in terms of π\pi, states from which a “steepest ascent” path leads to yy).

A fundamental intuitive picture is that the main reason why a reversible chain may relax slowly is that there is more than one potential well, and the chain takes a long time to move from one well to another. In such a case, π\pi conditioned to a single potential well will be a metastable (i.e. almost-stationary) distribution. One expects the chain’s distribution, from any initial state, to reach fairly quickly one (or a mixture) of these metastable distributions, and then the actual relaxation time to stationarity is dominated by the times taken to move between wells. In more detail, if there are ww wells then one can consider, as a coarse-grained approximation, a ww-state continuous-time chain where the transition rates w1w2w_{1}\to w_{2} are the rates of moving from well w1w_{1} to well w2w_{2}. Then τ2\tau_{2} for the original chain should be closely approximated by τ2\tau_{2} for the coarse-grained chain.

The hierarchical Normal model. As a very simple instance of (11.1), take d=1,p=2d=1,p=2 and xϕ(μ,σ2,x)x\to\phi(\mu,\sigma^{2},x) the Normal(μ,σ2)(\mu,\sigma^{2}) density. Then let (μ,σ)(\mu,\sigma) be chosen independently for each individual from some joint density h(μ,σ)h(\mu,\sigma) on R×R+R\times R^{+}. The data is an nn-vector 𝐱=(x1,,xn){\bf x}=(x_{1},\ldots,x_{n}) and the full posterior distribution is

f𝐱(μ1,,μn,σ1,,σn)=z𝐱-1i=1nh(μi,σi)ϕ(μi,σi2,xi).f_{{\bf x}}(\mu_{1},\ldots,\mu_{n},\sigma_{1},\ldots,\sigma_{n})=z_{{\bf x}}^{% -1}\prod_{i=1}^{n}h(\mu_{i},\sigma_{i})\phi(\mu_{i},\sigma_{i}^{2},x_{i}).

Typically we are interested in a posterior mean of μi\mu_{i} for fixed ii, that is g¯\bar{g} for

g(μ1,,μn,σ1,,σn):=μi.g(\mu_{1},\ldots,\mu_{n},\sigma_{1},\ldots,\sigma_{n}):=\mu_{i}.

Pragmatism is easier in theory than in practice. In comparing MCMC methods experimentally, one obvious issue is the choice of example to study. Another issue is that, if we measure “time” as “number of steps”, then a step of one chain may not be comparable with a step of another chain. For instance, a Metropolis step is typically easier to implement than a Gibbs step. More subtlely, in combinatorial examples there may be different ways to set up a data structure to represent the current state in a way that permits easy computation of π\pi-values. The alternative of measuring “time” as CPU time introduces different problems – details of coding matter.