11 Markov Chain Monte Carlo (January 8 2001)

11.5 The diffusion heuristic for optimal scaling of high dimensional Metropolis

In any Metropolis scheme for sampling from a target distribution on RdR^{d}, there arises the question of how large to take the steps of the proposal chain. One can answer this for isotropic-proposal schemes in high dimensions, in the setting where the target is a product distribution, and the result in this (very artificial) setting provides a heuristic for more realistic settings exemplified by (11.1).

11.5.1 Optimal scaling for high-dimensional product distribution sampling

Fix a probability density function ff on R1R^{1}. For large dd consider the i.i.d. product distribution πf(d𝐱)=i=1df(xi)dxi\pi_{f}(d{\bf x})=\prod_{i=1}^{d}f(x_{i})\ dx_{i} for 𝐱=(xi)Rd{\bf x}=(x_{i})\in R^{d}. Suppose we want to sample from πf\pi_{f} using Metropolis or Gibbs; what is the optimal scaling (as a function of dd) for the step size of the proposal chain, and how does the relaxation time scale?

For the Gibbs sampler this question is straightforward. Consider the one-dimensional case, and take the proposal step increments to be Normal(0,σ2)(0,\sigma^{2}). Then (under technical conditions on ff – we omit technical conditions here and in Theorem 11.3) the Gibbs chain will have some finite relaxation time depending on ff and σ\sigma, and choosing the optimal σ*\sigma^{*} gives a relaxation time τ2(f)\tau_{2}(f), say. The Gibbs sampler chain in which we choose a random coordinate and propose changing only that coordinate (using the optimal σ*\sigma^{*} above) is a product chain in the sense of Chapter 4 section 6.2 (yyy 10/11/94 version), and so the relaxation time of this product chain is τ2Gibbs(f)=τ2(f)d\tau_{2}^{{\rm Gibbs}}(f)=\tau_{2}(f)\ d.

Though the argument above is very simple, it is unsatisfactory because there is no simple expression for relaxation time as a function of σ\sigma or for the optimal σ*\sigma^{*}. It turns out that this difficulty is eliminated in the isotropic-proposal Metropolis chain. In the Gibbs sampler above, the variance of the length of a proposed step is σ2\sigma^{2}, so we retain this property by specifying the steps of the proposal chain to have Normal(0,σ2d-1𝐈d)(0,\sigma^{2}d^{-1}{\bf I}_{d}) distribution. One expects the relaxation time to grow linearly in dd in this setting also. The following result of Roberts et al [292] almost proves this, and has other useful corollaries.

Theorem 11.3

Fix σ>0\sigma>0. Let (𝐗(t),t=0,1,2,)({\bf X}(t),t=0,1,2,\ldots) be the Metropolis chain for sampling from product measure πf\pi_{f} on RdR^{d} based on a proposal random walk with step distribution Normal(0,σ2d-1𝐈d)(0,\sigma^{2}d^{-1}{\bf I}_{d}). Write X(1)(t)X^{(1)}(t) for the first coordinate of 𝐗(t){\bf X}(t), and let Yd(t):=X(1)(td)Y_{d}(t):=X^{(1)}(\lfloor td\rfloor) be this coordinate process speeded up by a factor dd, for continuous 0t<0\leq t<\infty. Suppose 𝐗(0){\bf X}(0) has the stationary distribution πf\pi_{f}. Then

(Yd(t),0t<)d(Y(t),0t<) as d(Y_{d}(t),0\leq t<\infty)\ \stackrel{d}{\rightarrow}\ (Y(t),0\leq t<\infty)% \mbox{ as }d\to\infty (11.9)

where the limit process is the stationary one-dimensional diffusion

dYt=θ1/2dWt+θμ(Yt)dtdY_{t}=\theta^{1/2}dW_{t}+\theta\mu(Y_{t})dt (11.10)

for standard Brownian motion WtW_{t}, where

μ(y)\displaystyle\mu(y) :=\displaystyle:= f(y)2f(y)\displaystyle{\textstyle\frac{f^{\prime}(y)}{2f(y)}}
θ\displaystyle\theta :=\displaystyle:= 2σ2Φ(-σκ/2) where Φ\Phi is the Normal distribution function\displaystyle 2\sigma^{2}\Phi(-\sigma\kappa/2)\mbox{ where $\Phi$ is the % Normal distribution function}
κ\displaystyle\kappa :=\displaystyle:= ((f(x))2f(x)dx)1/2.\displaystyle\left(\int{\textstyle\frac{(f^{\prime}(x))^{2}}{f(x)}}\ dx\right)% ^{1/2}.

Moreover, as dd\to\infty the proportion of accepted proposals in the stationary chain tends to 2Φ(-σκ/2)2\Phi(-\sigma\kappa/2).

We outline the proof in section 11.5.3. The result may look complicated, so one piece of background may be helpful. Given a probability distribution on the integers, there is a Metropolis chain for sampling from it based on the simple random walk proposal chain. As a continuous-space analog, given a density ff on R1R^{1} there is a “Metropolis diffusion” with stationary density ff based on θ1/2Wt\theta^{1/2}W_{t} (for arbitrary constant θ\theta) as “proposal diffusion”, and this Metropolis diffusion is exactly the diffusion (11.10): see Notes to (yyy final Chapter).

Thus the appearance of the limit diffusion YY is not unexpected; what is important is the explicit formula for θ\theta in terms of σ\sigma and ff. Note that the parameter θ\theta affects the process (Yt)(Y_{t}) only as a speed parameter. That is, if Yt*Y^{*}_{t} is the process (11.10) with θ=1\theta=1 then the general process can be represented as Yt=Yθt*Y_{t}=Y^{*}_{\theta t}. In particular, the relaxation time scales as τ2(Y)=θ-1τ2(Y*)\tau_{2}(Y)=\theta^{-1}\tau_{2}(Y^{*}). Thus we seek to maximize θ\theta as a function of the underlying step variance σ\sigma, and a simple numerical calculation shows this is maximized by taking σ=2.38/κ\sigma=2.38/\kappa, giving θ=1.3/κ2\theta=1.3/\kappa^{2}.

Thus Theorem 11.3 suggests that for the Metropolis chain 𝐗{\bf X}, the optimal variance is 2.382κ-2d-1𝐈d2.38^{2}\kappa^{-2}d^{-1}{\bf I}_{d}, and suggests that the relaxation time τ2(f,d)\tau_{2}(f,d) scales as

τ2(f,d)dκ21.3τ2(Y*).\tau_{2}(f,d)\sim\frac{d\kappa^{2}}{1.3}\ \tau_{2}(Y^{*}). (11.11)

In writing (11.11) we are pretending that the Metropolis chain is a product chain (so that its relaxation time is the relaxation time of its individual components) and that relaxation time can be passed to the limit in (11.9). Making a rigorous proof of (11.11) seems hard.

11.5.2 The diffusion heuristic.

Continuing the discussion above, Theorem 11.3 says that the long-run proportion of proposed moves which are accepted is 2Φ(-κσ/2)2\Phi(-\kappa\sigma/2). At the optimal value σ=2.38/κ\sigma=2.38/\kappa we find this proportion is a “pure number” 0.230.23, which does not depend on ff. To quote [292]

This result gives rise to the useful heuristic for random walk Metropolis in practice:

Tune the proposal variance so that the average acceptance rate is roughly 1/41/4.

We call this the diffusion heuristic for proposal-step scaling. Intuitively one might hope that the heuristic would be effective for fairly general unimodel target densities on RdR^{d}, though it clearly has nothing to say about the problem of passage between modes in a multimodal target. Note also that to invoke the diffusion heuristic in a combinatorial setting, where the proposal chain is random walk on a graph, one needs to assume that the target distribution is “smooth” in the sense that π(v)/π(w)1\pi(v)/\pi(w)\approx 1 for a typical edge (v,w)(v,w). In this case one can make a Metropolis chain in which the proposal chain jumps σ\sigma edges in one step, and seek to optimize σ\sigma. See Roberts [294] for some analysis in the context of smooth distributions on the dd-cube. However, such smoothness assumptions seem inapplicable to most practical combinatorial MCMC problems.

11.5.3 Sketch proof of Theorem

Write a typical step of the proposal chain as

(x1,x2,,xd)(x1+ξ1,x2+ξ2,,xd+ξd).(x_{1},x_{2},\ldots,x_{d})\to(x_{1}+\xi_{1},x_{2}+\xi_{2},\ldots,x_{d}+\xi_{d}).

Write

J=logf(xi+ξ1)f(x1); S=logi=2df(xi+ξi)f(xi).J=\log\frac{f(x_{i}+\xi_{1})}{f(x_{1})};\quad S=\log\prod_{i=2}^{d}\frac{f(x_{% i}+\xi_{i})}{f(x_{i})}.

The step is accepted with probability min(1,i=1df(xi+ξi)f(xi))=min(1,eJ+S)\min(1,\prod_{i=1}^{d}\frac{f(x_{i}+\xi_{i})}{f(x_{i})})=\min(1,e^{J+S}). So the increment of the first coordinate of the Metropolis chain has mean and mean-square Eξ1min(1,eJ+S)E\xi_{1}\ \min(1,e^{J+S}) and Eξ12min(1,eJ+S)E\xi_{1}^{2}\ \min(1,e^{J+S}). The essential issue in the proof is to show that, for “typical” values of (x2,,xn)(x_{2},\ldots,x_{n}),

Eξ1min(1,eJ+S)\displaystyle E\xi_{1}\ \min(1,e^{J+S}) \displaystyle\sim θμ(x1)/d\displaystyle\theta\mu(x_{1})/d (11.12)
Eξ12min(1,eJ+S)\displaystyle E\xi_{1}^{2}\ \min(1,e^{J+S}) \displaystyle\sim θ/d.\displaystyle\theta/d. (11.13)

This identifies the asymptotic drift and variance rates of Yd(t)Y_{d}(t) with those of Y(t)Y(t).

Write h(u):=Emin(1,eu+S)h(u):=E\min(1,e^{u+S}). Since

Jlog(1+f(x1)f(x1)ξ1)f(x1)f(x1)ξ1=2μ(x1)ξ1,J\approx\log\left(1+{\textstyle\frac{f^{\prime}(x_{1})}{f(x_{1})}}\xi_{1}% \right)\approx{\textstyle\frac{f^{\prime}(x_{1})}{f(x_{1})}}\xi_{1}=2\mu(x_{1}% )\xi_{1},

the desired estimates (11.12,11.13) can be rewritten as

EJh(J)\displaystyle EJh(J) \displaystyle\sim 2θμ2(x1)/d\displaystyle 2\theta\mu^{2}(x_{1})/d (11.14)
EJ2h(J)\displaystyle EJ^{2}h(J) \displaystyle\sim 4θμ2(x1)/d.\displaystyle 4\theta\mu^{2}(x_{1})/d. (11.15)

Now if JJ has Normal(0,β2)(0,\beta^{2}) distribution then for sufficiently regular h()h(\cdot) we have

EJh(J)β2h(0); EJ2h(J)β2h(0) as β0.EJh(J)\sim\beta^{2}h^{\prime}(0);\quad EJ^{2}h(J)\sim\beta^{2}h(0)\mbox{ as }% \beta\to 0.

Since JJ has approximately Normal(0,4μ2(x1)varξ1=4μ2(x1)σ2/d)(0,4\mu^{2}(x_{1}){\rm var}\ \xi_{1}=4\mu^{2}(x_{1})\sigma^{2}/d) distribution, proving (11.14,11.15) reduces to proving

h(0)\displaystyle h^{\prime}(0) \displaystyle\to θ2σ2\displaystyle\frac{\theta}{2\sigma^{2}} (11.16)
h(0)\displaystyle h(0) \displaystyle\to θσ2.\displaystyle\frac{\theta}{\sigma^{2}}. (11.17)

We shall argue

dist(S) is approximately Normal(-κ2σ2/2,κ2σ2).{\rm dist}(S)\mbox{ is approximately Normal}(-\kappa^{2}\sigma^{2}/2,\kappa^{2% }\sigma^{2}). (11.18)

Taking the first two terms in the expansion of log(1+u)\log(1+u) gives

logf(xi+ξi)f(xi)f(xi)f(xi)ξi-12(f(xi)f(xi))2ξi2.\log{\textstyle\frac{f(x_{i}+\xi_{i})}{f(x_{i})}}\approx{\textstyle\frac{f^{% \prime}(x_{i})}{f(x_{i})}}\ \xi_{i}-{\textstyle\frac{1}{2}}\left({\textstyle% \frac{f^{\prime}(x_{i})}{f(x_{i})}}\right)^{2}\xi_{i}^{2}.

Write K(𝐱)=d-1i=2d(f(xi)f(xi))2K({\bf x})=d^{-1}\sum_{i=2}^{d}({\textstyle\frac{f^{\prime}(x_{i})}{f(x_{i})}}% )^{2}. Summing the previous approximation over ii, the first sum on the right has approximately Normal(0,σ2K(𝐱))(0,\sigma^{2}K({\bf x})) distribution, and (using the weighted law of large numbers) the second term is approximately -12σ2K(𝐱)-\frac{1}{2}\sigma^{2}K({\bf x}). So the distribution of SS is approximately Normal(-K(𝐱)σ2/2,K(𝐱)σ2)(-K({\bf x})\sigma^{2}/2,K({\bf x})\sigma^{2}). But by the law of large numbers, for a typical 𝐱{\bf x} drawn from the product distribution πf\pi_{f} we have K(𝐱)κ2K({\bf x})\approx\kappa^{2}, giving (11.18).

To argue (11.17) we pretend SS has exactly the Normal distribution at (11.18). By a standard formula, if SS has Normal(α,β2)(\alpha,\beta^{2}) distribution then

Emax(1,eS)=Φ(α/β)+eα+β2/2Φ(-β-α/β).E\max(1,e^{S})=\Phi(\alpha/\beta)+e^{\alpha+\beta^{2}/2}\Phi(-\beta-\alpha/% \beta).

This leads to

h(0)=2Φ(-κσ/2)h(0)=2\Phi(-\kappa\sigma/2)

which verifies (11.17). From the definition of h(u)h(u) we see

h(0)=EeS1(S0)=h(0)-P(S0)=h(0)-Φ(-κσ/2)=Φ(-κσ/2)h^{\prime}(0)=Ee^{S}1_{(S\leq 0)}=h(0)-P(S\geq 0)=h(0)-\Phi(-\kappa\sigma/2)=% \Phi(-\kappa\sigma/2)

which verifies (11.16).