11.3 Variants of basic MCMC‣ Chapter 11 Markov Chain Monte Carlo (January 8 2001) ‣ Reversible Markov Chains and Random Walks on Graphs

11.3.1 Metropolized line sampling

Within the Gibbs or hit-and-run scheme, at each step one needs to sample from a one-dimensional distribution, but a different one-dimensional distribution each time. As mentioned in section 11.1.2, this is in general not easy to implement efficiently. An alternative is Metropolized line sampling, where one instead takes a single step of a Metropolis (i.e. propose/accept) chain with the correct stationary distribution. To say the idea abstractly, in the general “line sampling” setting of section 11.2.2, assume also:

for each $i$ we have an irreducible transition matrix $K^{i}$ on $S_{i}$ whose stationary distribution is $\pi^{[i]}$ .

Then define a step $x\to y$ of the Metropolized line sampler as

•

choose $i$ from $w(\cdot,x)$ ;
•

then choose $y$ from $k^{i}(x,\cdot)$ .

It is easy to check that the chain has stationary distribution $\pi$ , and is reversible if the $K^{i}$ are reversible, so in particular if the $K^{i}$ are defined by a Metropolis-type propose-accept scheme. In the simplest setting where the line sampler is the Gibbs sampler and we use the same one-dimensional proposal step distribution each time, this scheme is Metropolis-within-Gibbs. In that context is seems intuitively natural to use a long-tailed proposal distribution such as the Cauchy distribution. Because we might encounter wildly different one-dimensional target densities, e.g. one density with s.d. $1/10$ and another with two modes separated by $10$ , and using a $U(-L,L)$ step proposal would be inefficient in the latter case if $L$ is small, and inefficient in the former case if $L$ is large. Intuitively, a long-tailed distribution avoids these worst cases, at the cost of having the acceptance rate be smaller in good cases.

11.3.2 Multiple-try Metropolis

In the setting (section 11.2.1) of the Metropolis scheme, one might consider making several draws from the proposal distribution and choosing one of them to be the proposed move. Here is one way, suggested by Liu et al [235], to implement this idea. It turns out that to ensure the stationary distribution is the target distribution $\pi$ , we need extra samples which are used only to adjust the acceptance probability of the proposed step.

For simplicity, we take the case of a symmetric proposal matrix $K$ . Fix $m\geq 2$ . Define a step from $x$ of the multiple-try Metropolis (MTM) chain as follows.

•

Choose $y_{1},\ldots,y_{m}$ independently from $k(x,\cdot)$ ;
•

Choose $y_{i}$ with probability proportional to $\pi(y_{i})$ ;
•

Choose $x_{1},\ldots,x_{m-1}$ independently from $k(y_{i},\cdot)$ , and set $x_{m}=x$ ;
•

Accept the proposed move $x\to y_{i}$ with probability $\min\left(1,\frac{\sum_{i}\pi(y_{i})}{\sum_{i}\pi(x_{i})}\right)$ .

Irreducibility follows from irreducibility of $K$ . To check detailed balance, write the acceptance probability as $\min(1,q)$ . Then

p_{xy}=mk_{xy}\sum\ \prod_{i=1}^{m-1}k_{x,y_{i}}\prod_{i=1}^{m-1}k_{y,x_{i}}\ % \frac{\pi_{y}}{\sum_{i}\pi_{y_{i}}}\min(1,q)

where the first sum is over ordered $(2m-2)$ -tuples $(y_{1},\ldots,y_{m-1},x_{1},\ldots,x_{m-1})$ . So we can write

\pi_{x}p_{xy}=mk_{xy}\pi_{x}\pi_{y}\sum\ \prod_{i=1}^{m-1}k_{x,y_{i}}\prod_{i=% 1}^{m-1}k_{y,x_{i}}\min\left(\frac{1}{\sum_{i}\pi_{y_{i}}},\frac{q}{\sum_{i}% \pi_{y_{i}}}\right).

The choice of $q$ makes the final term become $\min(\frac{1}{\sum_{i}\pi_{y_{i}}},\frac{1}{\sum_{i}\pi_{x_{i}}})$ . One can now check $\pi_{x}p_{xy}=\pi_{y}p_{yx}$ , by switching the roles of $x_{j}$ and $y_{j}$ .

To compare MTM with single-try Metropolis, consider the $m\to\infty$ limit, in which the empirical distribution of $y_{1},\ldots,y_{m}$ will approach $k(x,\cdot)$ , and so the distribution of the chosen $y_{i}$ will approach $k(x,\cdot)\pi(\cdot)/a_{x}$ for $a_{x}:=\sum_{y}k_{xy}\pi_{y}$ . Thus for large $m$ the transition matrix of MTM will approximate

p^{\infty}_{xy}=\frac{k_{xy}\pi_{y}}{a_{x}}\ \min(1,a_{x}/a_{y}),\quad y\neq x.

To compare with single-try Metropolis $P$ , rewrite both as

	$\displaystyle p^{\infty}_{xy}$	$\displaystyle=$	$\displaystyle k_{xy}\pi_{y}\min\left({\textstyle\frac{1}{a_{x}}},{\textstyle% \frac{1}{a_{y}}}\right),\quad y\neq x$
	$\displaystyle p_{xy}$	$\displaystyle=$	$\displaystyle k_{xy}\pi_{y}\min\left({\textstyle\frac{1}{\pi_{x}}},{\textstyle% \frac{1}{\pi_{y}}}\right),\quad y\neq x.$

Thinking of a step of the proposal chain as being in a random direction unrelated to the behavior of $\pi$ , from a $\pi$ -typical state $x$ we expect a proposed move to tend to make $\pi$ decrease, so we expect $a_{x}<\pi_{x}$ for $\pi$ -typical $x$ . In this sense, the equations above show that MTM is an improvement. Of course, if we judge “cost” in terms of the number of evaluations of $\pi_{x}$ , then a step of MTM costs $2m-1$ times the cost of single-step Metropolis. By this criterion it seems implausible that MTM would be cheaper than single-step. On the other hand one can envisage settings where there is substantial cost in updating a data structure associated with the current state $x$ , and in such a setting MTM may be more appealing.

11.3.3 Multilevel sampling

Writing $\pi(x)\propto\exp(-H(x))$ , as in the statistical physics imagery (section 11.1.2), suggests defining a one-parameter family of probability distributions by

\pi_{\theta}(x)\propto\exp(-\theta H(x)).

(In the physics analogy, $\theta$ corresponds to 1/temperature). If $\pi$ is multimodal we picture $\pi_{\theta}$ , as $\theta$ increases from $0$ to $1$ , interpolating between the uniform distribution and $\pi$ by making the potential wells grow deeper. Fix a proposal matrix $K$ , and let $P_{\theta}$ be the transition matrix for the Metropolized chain (11.5) associated with $K$ and $\pi_{\theta}$ . Now fix $L$ and values $0=\theta_{1}<\theta_{2}<\ldots<\theta_{L}=1$ . The idea is that for small $\theta$ the $P_{\theta}$ -chain should have less difficulty moving between wells; for $\theta=1$ we get the correct distribution within each well; so by varying $\theta$ we can somehow sample accurately from all wells. There are several ways to implement this idea. Simulated tempering [254] defines a chain on state space $S\times\{1,\ldots,L\}$ , where state $(x,i)$ represents configuration $x$ and parameter $\theta_{i}$ , and where each step is either of the form

•

$(x,i)\to(x^{\prime},i)$ ; $x\to x^{\prime}$ a step of $P_{\theta_{i}}$

or of the form

•

$(x,i)\to(x,i^{\prime})$ ; where $i\to i^{\prime}$ is a proposed step of simple random walk on $\{1,2,\ldots,L\}$ .

However, implementing this idea is slightly intricate, because normalizing constants $z_{\theta}$ enter into the desired acceptance probabilities. A more elegant variation is the multilevel exchange chain suggested by Geyer [163] and implemented in statistical physics by Hukushima and Nemoto [185]. First consider $L$ independent chains, where the $i$ ’th chain $X^{(i)}_{t}$ has transition matrix $P_{\theta_{i}}$ . Then introduce an interaction; propose to switch configurations $X^{(i)}$ and $X^{(i+1)}$ , and accept with the appropriate probability. Precisely, take state space $S^{L}$ with states ${\bf x}=(x_{1},\ldots,x_{L})$ . Fix a (small) number $0<\alpha<1$ .

•

With probability $1-\alpha$ pick $i$ uniformly from $\{1,\ldots,L\}$ , pick $x^{\prime}_{i}$ according to $P_{\theta_{i}}(x_{i},\cdot)$ and update ${\bf x}$ by changing $x_{i}$ to $x^{\prime}_{i}$ .

•

With probability $\alpha$ , pick uniformly an adjacent pair $(i,i+1)$ , and propose to update ${\bf x}$ by replacing $(x_{i},x_{i+1})$ by $(x_{i+1},x_{i})$ . Accept this proposed move with probability

\min\left(1,\frac{\pi_{\theta_{i}}(x_{i+1})\pi_{\theta_{i+1}}(x_{i})}{\pi_{% \theta_{i}}(x_{i})\pi_{\theta_{i+1}}(x_{i+1})}\right).

To check that the product $\pi=\pi_{\theta_{1}}\times\ldots\times\pi_{\theta_{L}}$ is indeed a stationary distribution, write the acceptance probability as $\min(1,q)$ . If ${\bf x}$ and ${\bf x}^{\prime}$ differ only by interchange of $(x_{i},x_{i+1})$ then

\frac{\pi({\bf x})}{\pi({\bf x}^{\prime})}\frac{p({\bf x},{\bf x}^{\prime})}{p% ({\bf x}^{\prime},{\bf x})}=\frac{\pi({\bf x})}{\pi({\bf x}^{\prime})}\frac{{% \textstyle\frac{\alpha}{L-1}}\min(1,q)}{{\textstyle\frac{\alpha}{L-1}}\min(1,q% ^{-1})}=\frac{\pi({\bf x})}{\pi({\bf x}^{\prime})}q

and the definition of $q$ makes the expression $=1$ . The case of steps where only one component changes is easier to check.

11.3.4 Multiparticle MCMC

Consider the setting of section 11.2.2. There is a target distribution $\pi$ on $S$ and a collection of subsets $(S_{i})$ . Write $\pi^{[i]}=\pi(\cdot|S_{i})$ and $I(x)=\{i:x\in S_{i}\}$ . Now fix $m\geq 2$ . We can use the line-sampling scheme of section 11.2.2 to define (recall Chapter 4 section 6.2) (yyy 10/11/94 version) a product chain on $S^{m}$ with stationary distribution $\pi\times\pi\times\ldots\times\pi=\pi^{k}$ . For this product chain, picture $m$ particles, at each step picking a random particle and making it move as a step from the line-sampling chain. Now let us introduce an interaction: the line along which a particle moves may depend on the positions of the other particles.

Here is a precise construction. Suppose that for each $(x,\hat{{\bf x}})\in S\times S^{m-1}$ we are given a probability distribution $w(\cdot,x,\hat{{\bf x}})$ on $I(x)$ satisfying the following analog of (11.6):

x,y\in S_{i}

(11.8)

A step of the chain from $(x_{i})$ is defined by

•

Pick $k$ uniformly from $\{1,2,\ldots,m\}$
•

Pick $i$ from $w(\cdot,x_{k},(x_{i},i\neq k))$
•

Pick $x^{\prime}_{k}$ from $\pi^{[i]}(\cdot)$
•

Update $(X_{i})$ by replacing $x_{k}$ by $x^{\prime}_{k}$ .

It is easy to check that $\pi^{m}$ is indeed a stationary distribution; and the chain is irreducible under condition (11.7). Of course we could, as in section 11.3.1, use a Metropolis step instead of sampling from $\pi^{[K]}$ .

Constructions of this type in statistical applications on $R^{d}$ go back to Gilks et al [166], under the name adaptive directional sampling. In particular they suggested picking a distinct pair $(j,k)$ of the “particles” and taking the straight line through $x_{j}$ and $x_{k}$ as the line to sample $x^{\prime}_{k}$ from. Liu et al [235] suggest combining this idea with mode-hunting. Again pick a distinct pair $(j,k)$ of “particles”; but now use some algorithm to find a local maximum $m(x_{j})$ of the target density starting from $x_{j}$ , and sample $x^{\prime}_{k}$ from the line through $x_{k}$ and $m(x_{j})$ .