11 Markov Chain Monte Carlo (January 8 2001)

11.3 Variants of basic MCMC

11.3.1 Metropolized line sampling

Within the Gibbs or hit-and-run scheme, at each step one needs to sample from a one-dimensional distribution, but a different one-dimensional distribution each time. As mentioned in section 11.1.2, this is in general not easy to implement efficiently. An alternative is Metropolized line sampling, where one instead takes a single step of a Metropolis (i.e. propose/accept) chain with the correct stationary distribution. To say the idea abstractly, in the general “line sampling” setting of section 11.2.2, assume also:

for each ii we have an irreducible transition matrix KiK^{i} on SiS_{i} whose stationary distribution is π[i]\pi^{[i]}.

Then define a step xyx\to y of the Metropolized line sampler as

  • choose ii from w(,x)w(\cdot,x);

  • then choose yy from ki(x, )k^{i}(x,\cdot).

It is easy to check that the chain has stationary distribution π\pi, and is reversible if the KiK^{i} are reversible, so in particular if the KiK^{i} are defined by a Metropolis-type propose-accept scheme. In the simplest setting where the line sampler is the Gibbs sampler and we use the same one-dimensional proposal step distribution each time, this scheme is Metropolis-within-Gibbs. In that context is seems intuitively natural to use a long-tailed proposal distribution such as the Cauchy distribution. Because we might encounter wildly different one-dimensional target densities, e.g. one density with s.d. 1/101/10 and another with two modes separated by 1010, and using a U(-L,L)U(-L,L) step proposal would be inefficient in the latter case if LL is small, and inefficient in the former case if LL is large. Intuitively, a long-tailed distribution avoids these worst cases, at the cost of having the acceptance rate be smaller in good cases.

11.3.2 Multiple-try Metropolis

In the setting (section 11.2.1) of the Metropolis scheme, one might consider making several draws from the proposal distribution and choosing one of them to be the proposed move. Here is one way, suggested by Liu et al [235], to implement this idea. It turns out that to ensure the stationary distribution is the target distribution π\pi, we need extra samples which are used only to adjust the acceptance probability of the proposed step.

For simplicity, we take the case of a symmetric proposal matrix KK. Fix m2m\geq 2. Define a step from xx of the multiple-try Metropolis (MTM) chain as follows.

  • Choose y1,,ymy_{1},\ldots,y_{m} independently from k(x, )k(x,\cdot);

  • Choose yiy_{i} with probability proportional to π(yi)\pi(y_{i});

  • Choose x1,,xm-1x_{1},\ldots,x_{m-1} independently from k(yi, )k(y_{i},\cdot), and set xm=xx_{m}=x;

  • Accept the proposed move xyix\to y_{i} with probability min(1,iπ(yi)iπ(xi))\min\left(1,\frac{\sum_{i}\pi(y_{i})}{\sum_{i}\pi(x_{i})}\right).

Irreducibility follows from irreducibility of KK. To check detailed balance, write the acceptance probability as min(1,q)\min(1,q). Then

pxy=mkxyi=1m-1kx,yii=1m-1ky,xiπyiπyimin(1,q)p_{xy}=mk_{xy}\sum\ \prod_{i=1}^{m-1}k_{x,y_{i}}\prod_{i=1}^{m-1}k_{y,x_{i}}\ % \frac{\pi_{y}}{\sum_{i}\pi_{y_{i}}}\min(1,q)

where the first sum is over ordered (2m-2)(2m-2)-tuples (y1,,ym-1,x1,,xm-1)(y_{1},\ldots,y_{m-1},x_{1},\ldots,x_{m-1}). So we can write

πxpxy=mkxyπxπyi=1m-1kx,yii=1m-1ky,ximin(1iπyi,qiπyi).\pi_{x}p_{xy}=mk_{xy}\pi_{x}\pi_{y}\sum\ \prod_{i=1}^{m-1}k_{x,y_{i}}\prod_{i=% 1}^{m-1}k_{y,x_{i}}\min\left(\frac{1}{\sum_{i}\pi_{y_{i}}},\frac{q}{\sum_{i}% \pi_{y_{i}}}\right).

The choice of qq makes the final term become min(1iπyi,1iπxi)\min(\frac{1}{\sum_{i}\pi_{y_{i}}},\frac{1}{\sum_{i}\pi_{x_{i}}}). One can now check πxpxy=πypyx\pi_{x}p_{xy}=\pi_{y}p_{yx}, by switching the roles of xjx_{j} and yjy_{j}.

To compare MTM with single-try Metropolis, consider the mm\to\infty limit, in which the empirical distribution of y1,,ymy_{1},\ldots,y_{m} will approach k(x, )k(x,\cdot), and so the distribution of the chosen yiy_{i} will approach k(x, )π()/axk(x,\cdot)\pi(\cdot)/a_{x} for ax:=ykxyπya_{x}:=\sum_{y}k_{xy}\pi_{y}. Thus for large mm the transition matrix of MTM will approximate

pxy=kxyπyaxmin(1,ax/ay), yx.p^{\infty}_{xy}=\frac{k_{xy}\pi_{y}}{a_{x}}\ \min(1,a_{x}/a_{y}),\quad y\neq x.

To compare with single-try Metropolis PP, rewrite both as

pxy\displaystyle p^{\infty}_{xy} =\displaystyle= kxyπymin(1ax,1ay), yx\displaystyle k_{xy}\pi_{y}\min\left({\textstyle\frac{1}{a_{x}}},{\textstyle% \frac{1}{a_{y}}}\right),\quad y\neq x
pxy\displaystyle p_{xy} =\displaystyle= kxyπymin(1πx,1πy), yx.\displaystyle k_{xy}\pi_{y}\min\left({\textstyle\frac{1}{\pi_{x}}},{\textstyle% \frac{1}{\pi_{y}}}\right),\quad y\neq x.

Thinking of a step of the proposal chain as being in a random direction unrelated to the behavior of π\pi, from a π\pi-typical state xx we expect a proposed move to tend to make π\pi decrease, so we expect ax<πxa_{x}<\pi_{x} for π\pi-typical xx. In this sense, the equations above show that MTM is an improvement. Of course, if we judge “cost” in terms of the number of evaluations of πx\pi_{x}, then a step of MTM costs 2m-12m-1 times the cost of single-step Metropolis. By this criterion it seems implausible that MTM would be cheaper than single-step. On the other hand one can envisage settings where there is substantial cost in updating a data structure associated with the current state xx, and in such a setting MTM may be more appealing.

11.3.3 Multilevel sampling

Writing π(x)exp(-H(x))\pi(x)\propto\exp(-H(x)), as in the statistical physics imagery (section 11.1.2), suggests defining a one-parameter family of probability distributions by

πθ(x)exp(-θH(x)).\pi_{\theta}(x)\propto\exp(-\theta H(x)).

(In the physics analogy, θ\theta corresponds to 1/temperature). If π\pi is multimodal we picture πθ\pi_{\theta}, as θ\theta increases from 00 to 11, interpolating between the uniform distribution and π\pi by making the potential wells grow deeper. Fix a proposal matrix KK, and let PθP_{\theta} be the transition matrix for the Metropolized chain (11.5) associated with KK and πθ\pi_{\theta}. Now fix LL and values 0=θ1<θ2<<θL=10=\theta_{1}<\theta_{2}<\ldots<\theta_{L}=1. The idea is that for small θ\theta the PθP_{\theta}-chain should have less difficulty moving between wells; for θ=1\theta=1 we get the correct distribution within each well; so by varying θ\theta we can somehow sample accurately from all wells. There are several ways to implement this idea. Simulated tempering [254] defines a chain on state space S×{1,,L}S\times\{1,\ldots,L\}, where state (x,i)(x,i) represents configuration xx and parameter θi\theta_{i}, and where each step is either of the form

  • (x,i)(x,i)(x,i)\to(x^{\prime},i);  xxx\to x^{\prime} a step of PθiP_{\theta_{i}}

or of the form

  • (x,i)(x,i)(x,i)\to(x,i^{\prime});  where iii\to i^{\prime} is a proposed step of simple random walk on {1,2,,L}\{1,2,\ldots,L\}.

However, implementing this idea is slightly intricate, because normalizing constants zθz_{\theta} enter into the desired acceptance probabilities. A more elegant variation is the multilevel exchange chain suggested by Geyer [163] and implemented in statistical physics by Hukushima and Nemoto [185]. First consider LL independent chains, where the ii’th chain Xt(i)X^{(i)}_{t} has transition matrix PθiP_{\theta_{i}}. Then introduce an interaction; propose to switch configurations X(i)X^{(i)} and X(i+1)X^{(i+1)}, and accept with the appropriate probability. Precisely, take state space SLS^{L} with states 𝐱=(x1,,xL){\bf x}=(x_{1},\ldots,x_{L}). Fix a (small) number 0<α<10<\alpha<1.

  • With probability 1-α1-\alpha pick ii uniformly from {1,,L}\{1,\ldots,L\}, pick xix^{\prime}_{i} according to Pθi(xi, )P_{\theta_{i}}(x_{i},\cdot) and update 𝐱{\bf x} by changing xix_{i} to xix^{\prime}_{i}.

  • With probability α\alpha, pick uniformly an adjacent pair (i,i+1)(i,i+1), and propose to update 𝐱{\bf x} by replacing (xi,xi+1)(x_{i},x_{i+1}) by (xi+1,xi)(x_{i+1},x_{i}). Accept this proposed move with probability

    min(1,πθi(xi+1)πθi+1(xi)πθi(xi)πθi+1(xi+1)).\min\left(1,\frac{\pi_{\theta_{i}}(x_{i+1})\pi_{\theta_{i+1}}(x_{i})}{\pi_{% \theta_{i}}(x_{i})\pi_{\theta_{i+1}}(x_{i+1})}\right).

To check that the product π=πθ1××πθL\pi=\pi_{\theta_{1}}\times\ldots\times\pi_{\theta_{L}} is indeed a stationary distribution, write the acceptance probability as min(1,q)\min(1,q). If 𝐱{\bf x} and 𝐱{\bf x}^{\prime} differ only by interchange of (xi,xi+1)(x_{i},x_{i+1}) then

π(𝐱)π(𝐱)p(𝐱,𝐱)p(𝐱,𝐱)=π(𝐱)π(𝐱)αL-1min(1,q)αL-1min(1,q-1)=π(𝐱)π(𝐱)q\frac{\pi({\bf x})}{\pi({\bf x}^{\prime})}\frac{p({\bf x},{\bf x}^{\prime})}{p% ({\bf x}^{\prime},{\bf x})}=\frac{\pi({\bf x})}{\pi({\bf x}^{\prime})}\frac{{% \textstyle\frac{\alpha}{L-1}}\min(1,q)}{{\textstyle\frac{\alpha}{L-1}}\min(1,q% ^{-1})}=\frac{\pi({\bf x})}{\pi({\bf x}^{\prime})}q

and the definition of qq makes the expression =1=1. The case of steps where only one component changes is easier to check.

11.3.4 Multiparticle MCMC

Consider the setting of section 11.2.2. There is a target distribution π\pi on SS and a collection of subsets (Si)(S_{i}). Write π[i]=π(|Si)\pi^{[i]}=\pi(\cdot|S_{i}) and I(x)={i:xSi}I(x)=\{i:x\in S_{i}\}. Now fix m2m\geq 2. We can use the line-sampling scheme of section 11.2.2 to define (recall Chapter 4 section 6.2) (yyy 10/11/94 version) a product chain on SmS^{m} with stationary distribution π×π××π=πk\pi\times\pi\times\ldots\times\pi=\pi^{k}. For this product chain, picture mm particles, at each step picking a random particle and making it move as a step from the line-sampling chain. Now let us introduce an interaction: the line along which a particle moves may depend on the positions of the other particles.

Here is a precise construction. Suppose that for each (x,𝐱^)S×Sm-1(x,\hat{{\bf x}})\in S\times S^{m-1} we are given a probability distribution w(,x,𝐱^)w(\cdot,x,\hat{{\bf x}}) on I(x)I(x) satisfying the following analog of (11.6):

 if x,ySix,y\in S_{i} then w(i,x,𝐱^)=w(i,y,𝐱^)w(i,x,\hat{{\bf x}})=w(i,y,\hat{{\bf x}}).\mbox{ if $x,y\in S_{i}$ then $w(i,x,\hat{{\bf x}})=w(i,y,\hat{{\bf x}})$}. (11.8)

A step of the chain from (xi)(x_{i}) is defined by

  • Pick kk uniformly from {1,2,,m}\{1,2,\ldots,m\}

  • Pick ii from w(,xk,(xi,ik))w(\cdot,x_{k},(x_{i},i\neq k))

  • Pick xkx^{\prime}_{k} from π[i]()\pi^{[i]}(\cdot)

  • Update (Xi)(X_{i}) by replacing xkx_{k} by xkx^{\prime}_{k}.

It is easy to check that πm\pi^{m} is indeed a stationary distribution; and the chain is irreducible under condition (11.7). Of course we could, as in section 11.3.1, use a Metropolis step instead of sampling from π[K]\pi^{[K]}.

Constructions of this type in statistical applications on RdR^{d} go back to Gilks et al [166], under the name adaptive directional sampling. In particular they suggested picking a distinct pair (j,k)(j,k) of the “particles” and taking the straight line through xjx_{j} and xkx_{k} as the line to sample xkx^{\prime}_{k} from. Liu et al [235] suggest combining this idea with mode-hunting. Again pick a distinct pair (j,k)(j,k) of “particles”; but now use some algorithm to find a local maximum m(xj)m(x_{j}) of the target density starting from xjx_{j}, and sample xkx^{\prime}_{k} from the line through xkx_{k} and m(xj)m(x_{j}).