11 Markov Chain Monte Carlo (January 8 2001)

11.4 A little theory

The chains designed for MCMC in previous sections are reversible, and therefore the theory of reversible chains developed in this book is available. Unfortunately there is very little extra to say – in that sense, there is no “theory of MCMC”. What follows is rather fragmentary observations.

11.4.1 Comparison methods

Consider the Metropolis chain

pxyMetro=kxymin(1,πykyxπxkxy), yx.p^{{\rm Metro}}_{xy}=k_{xy}\min\left(1,\frac{\pi_{y}k_{yx}}{\pi_{x}k_{xy}}% \right),\quad y\neq x.

The requirement that a step of a chain be constructible as a proposal from KK followed by acceptance/rejection, is the requirement that pxykxy,  yxp_{xy}\leq k_{xy},\ y\neq x. Recall the asymptotic variance rate

σ2(P,f):=limtt-1vars=1tf(Xs).\sigma^{2}(P,f):=\lim_{t}t^{-1}{\rm var}\ \sum_{s=1}^{t}f(X_{s}).
Lemma 11.1 (Peskun’s Theorem [280])

Given KK and π\pi, let PP be a reversible chain with pxykxy,  yxp_{xy}\leq k_{xy},\ y\neq x and with stationary distribution π\pi. Then σ2(P,f)σ2(PMetro,f)f\sigma^{2}(P,f)\geq\sigma^{2}(P^{{\rm Metro}},f)\ \forall f.

Proof. Reversibility of PP implies

pxy=πypyxπxπykyxπx=kxyπykyxπxkxyp_{xy}=\frac{\pi_{y}p_{yx}}{\pi_{x}}\leq\frac{\pi_{y}k_{yx}}{\pi_{x}}=k_{xy}\ % \frac{\pi_{y}k_{yx}}{\pi_{x}k_{xy}}

and hence

pxy=pxyMetroβxyp_{xy}=p^{{\rm Metro}}_{xy}\beta_{xy}

where βxy=βyx1,  yx\beta_{xy}=\beta_{yx}\leq 1,\ y\neq x. So the result follows directly from Peskun’s lemma (yyy Lemma 11.5, to be moved elsewhere). \Box

This result can be interpreted as saying that the Metropolis rates (11.5) are the optimal way of implementing a proposal-rejection scheme. Loosely speaking, a similar result holds in any natural Metropolis-like construction of a reversible chain using a max(1, )\max(1,\cdot) acceptance probability.

It is important to notice that Lemma 11.1 does not answer the following question, which (except for highly symmetric graphs) seems intractable.

Question. Given a connected graph and a probability distribution π\pi on its vertices, consider the class of reversible chains with stationary distribution π\pi and with transitions only across edges of the graph. Within that class, which chain has smallest relaxation time?

Unfortunately, standard comparison theorems don’t take us much further in comparing MCMC methods. To see why, consider Metropolis on RdR^{d} with isotropic Normal(0,σ2𝐈d)(0,\sigma^{2}{\bf I}_{d}) proposal steps. This has some relaxation time τ2(f,σ)\tau_{2}(f,\sigma), where ff is the target density. For σ1<σ2\sigma_{1}<\sigma_{2}, the normal densities gσ(x)g_{\sigma}(x) satisfy gσ2(x)/gσ1(x)(σ1/σ2)dg_{\sigma_{2}}(x)/g_{\sigma_{1}}(x)\geq(\sigma_{1}/\sigma_{2})^{d}. So the comparison theorem (Chapter 3 Lemma 29) (yyy 9/2/94 version) shows

τ2(f,σ2)(σ1/σ2)dτ2(f,σ2), σ1<σ2.\tau_{2}(f,\sigma_{2})\geq(\sigma_{1}/\sigma_{2})^{d}\tau_{2}(f,\sigma_{2}),% \quad\sigma_{1}<\sigma_{2}.

But this is no help in determining the optimal σ\sigma.

11.4.2 Metropolis with independent proposals

Though unrealistic in practical settings, the specialization of the Metropolis chain to the case where the proposal chain is i.i.d., that is where kxy=kyk_{xy}=k_{y}, is mathematically a natural object of study. In this setting the transition matrix (11.5) becomes

pxy=kymin(1,wy/wx), yxp_{xy}=k_{y}\min\left(1,w_{y}/w_{x}\right),\quad y\neq x

where wx:=πw/kxw_{x}:=\pi_{w}/k_{x}. It turns out there is a simple and sharp coupling analysis, based on the trick of labeling states as 1,2,,n1,2,\ldots,n so that w1w2wnw_{1}\geq w_{2}\geq\ldots\geq w_{n} (Liu [236] used this trick to give an eigenvalue analysis, extending part (b) below). Let ρ\rho be the chance that a proposed step from state 11 is rejected (count a proposed step from state 11 to 11 as always accepted). So

ρ=i=1nki(1-wiw1)<1.\rho=\sum_{i=1}^{n}k_{i}(1-{\textstyle\frac{w_{i}}{w_{1}}})<1.
Proposition 11.2

For the Metropolis chain over independent proposals, with states ordered as above,

(a) d¯(t)ρt\bar{d}(t)\leq\rho^{t}

(b) The relaxation time τ2=(1-ρ)-1\tau_{2}=(1-\rho)^{-1}.

Proof. For the chain started at state 11, the time TT of the first acceptance of a proposed step satisfies

P(T>t)=ρt.P(T>t)=\rho^{t}.

Recall from (yyy Chapter 4-3 section 1; 10/11/99 version) the notion of coupling. For this chain a natural coupling is obtained by using the same U(0,1)U(0,1) random variable to implement the accept/reject step (accept if U<P(accept)U<P(\mbox{accept})) in two versions of the chain. It is easy to check this coupling (Xt,Xt)(X_{t},X^{\prime}_{t}) respects the ordering: if X0X0X_{0}\leq X^{\prime}_{0} then XtXtX_{t}\leq X^{\prime}_{t}. At time TT the fact that a proposed jump from 11 is accepted implies that a jump from any other state must be accepted. So TT is a coupling time, and the coupling inequality (yyy Chapter 4-3 section 1.1; 10/11/99 version) implies d¯(t)P(T>t)\bar{d}(t)\leq P(T>t). This establishes (a), and the general inequality d¯(t)=Ω(λ2t)\bar{d}(t)=\Omega(\lambda_{2}^{t}) implies λ2ρ\lambda_{2}\leq\rho. On the other hand, for the chain started at state 11, on {T=1}\{T=1\} the time-11 distribution is π\pi; in other words

P1(X1)=ρδ1()+(1-ρ)π().P_{1}(X_{1}\in\cdot)=\rho\delta_{1}(\cdot)+(1-\rho)\pi(\cdot).

But this says that ρ\rho is an eigenvalue of PP (corresponding to the eigenvector δ1-π\delta_{1}-\pi), establishing (b). \Box

In the continuous-space setting, with a proposal distribution uniform on [0,1][0,1] and target density ff with f*:=maxxf(x)f^{*}:=\max_{x}f(x), part (b) implies the relaxation time τ2\tau_{2} equals f*f^{*}. So (unsurprisingly) Metropolis-over-independent is comparable to the basic rejection sampling scheme (section 11.1.2), which gives an exact sample in mean f*f^{*} steps.