11.5 The diffusion heuristic for optimal scaling of high dimensional Metropolis

In any Metropolis scheme for sampling from a target distribution on $R^{d}$, there arises the question of how large to take the steps of the proposal chain. One can answer this for isotropic-proposal schemes in high dimensions, in the setting where the target is a product distribution, and the result in this (very artificial) setting provides a heuristic for more realistic settings exemplified by (11.1).

11.5.1 Optimal scaling for high-dimensional product distribution sampling

Fix a probability density function $f$ on $R^{1}$. For large $d$ consider the i.i.d. product distribution $\pi_{f}(d{\bf x})=\prod_{i=1}^{d}f(x_{i})\ dx_{i}$ for ${\bf x}=(x_{i})\in R^{d}$. Suppose we want to sample from $\pi_{f}$ using Metropolis or Gibbs; what is the optimal scaling (as a function of $d$) for the step size of the proposal chain, and how does the relaxation time scale?

For the Gibbs sampler this question is straightforward. Consider the one-dimensional case, and take the proposal step increments to be Normal$(0,\sigma^{2})$. Then (under technical conditions on $f$ – we omit technical conditions here and in Theorem 11.3) the Gibbs chain will have some finite relaxation time depending on $f$ and $\sigma$, and choosing the optimal $\sigma^{*}$ gives a relaxation time $\tau_{2}(f)$, say. The Gibbs sampler chain in which we choose a random coordinate and propose changing only that coordinate (using the optimal $\sigma^{*}$ above) is a product chain in the sense of Chapter 4 section 6.2 (yyy 10/11/94 version), and so the relaxation time of this product chain is $\tau_{2}^{{\rm Gibbs}}(f)=\tau_{2}(f)\ d$.

Though the argument above is very simple, it is unsatisfactory because there is no simple expression for relaxation time as a function of $\sigma$ or for the optimal $\sigma^{*}$. It turns out that this difficulty is eliminated in the isotropic-proposal Metropolis chain. In the Gibbs sampler above, the variance of the length of a proposed step is $\sigma^{2}$, so we retain this property by specifying the steps of the proposal chain to have Normal$(0,\sigma^{2}d^{-1}{\bf I}_{d})$ distribution. One expects the relaxation time to grow linearly in $d$ in this setting also. The following result of Roberts et al [292] almost proves this, and has other useful corollaries.

Theorem 11.3

Fix $\sigma>0$. Let $({\bf X}(t),t=0,1,2,\ldots)$ be the Metropolis chain for sampling from product measure $\pi_{f}$ on $R^{d}$ based on a proposal random walk with step distribution Normal$(0,\sigma^{2}d^{-1}{\bf I}_{d})$. Write $X^{(1)}(t)$ for the first coordinate of ${\bf X}(t)$, and let $Y_{d}(t):=X^{(1)}(\lfloor td\rfloor)$ be this coordinate process speeded up by a factor $d$, for continuous $0\leq t<\infty$. Suppose ${\bf X}(0)$ has the stationary distribution $\pi_{f}$. Then

 $(Y_{d}(t),0\leq t<\infty)\ \stackrel{d}{\rightarrow}\ (Y(t),0\leq t<\infty)% \mbox{ as }d\to\infty$ (11.9)

where the limit process is the stationary one-dimensional diffusion

 $dY_{t}=\theta^{1/2}dW_{t}+\theta\mu(Y_{t})dt$ (11.10)

for standard Brownian motion $W_{t}$, where

 $\displaystyle\mu(y)$ $\displaystyle:=$ $\displaystyle{\textstyle\frac{f^{\prime}(y)}{2f(y)}}$ $\displaystyle\theta$ $\displaystyle:=$ $\Phi$ $\displaystyle\kappa$ $\displaystyle:=$ $\displaystyle\left(\int{\textstyle\frac{(f^{\prime}(x))^{2}}{f(x)}}\ dx\right)% ^{1/2}.$

Moreover, as $d\to\infty$ the proportion of accepted proposals in the stationary chain tends to $2\Phi(-\sigma\kappa/2)$.

We outline the proof in section 11.5.3. The result may look complicated, so one piece of background may be helpful. Given a probability distribution on the integers, there is a Metropolis chain for sampling from it based on the simple random walk proposal chain. As a continuous-space analog, given a density $f$ on $R^{1}$ there is a “Metropolis diffusion” with stationary density $f$ based on $\theta^{1/2}W_{t}$ (for arbitrary constant $\theta$) as “proposal diffusion”, and this Metropolis diffusion is exactly the diffusion (11.10): see Notes to (yyy final Chapter).

Thus the appearance of the limit diffusion $Y$ is not unexpected; what is important is the explicit formula for $\theta$ in terms of $\sigma$ and $f$. Note that the parameter $\theta$ affects the process $(Y_{t})$ only as a speed parameter. That is, if $Y^{*}_{t}$ is the process (11.10) with $\theta=1$ then the general process can be represented as $Y_{t}=Y^{*}_{\theta t}$. In particular, the relaxation time scales as $\tau_{2}(Y)=\theta^{-1}\tau_{2}(Y^{*})$. Thus we seek to maximize $\theta$ as a function of the underlying step variance $\sigma$, and a simple numerical calculation shows this is maximized by taking $\sigma=2.38/\kappa$, giving $\theta=1.3/\kappa^{2}$.

Thus Theorem 11.3 suggests that for the Metropolis chain ${\bf X}$, the optimal variance is $2.38^{2}\kappa^{-2}d^{-1}{\bf I}_{d}$, and suggests that the relaxation time $\tau_{2}(f,d)$ scales as

 $\tau_{2}(f,d)\sim\frac{d\kappa^{2}}{1.3}\ \tau_{2}(Y^{*}).$ (11.11)

In writing (11.11) we are pretending that the Metropolis chain is a product chain (so that its relaxation time is the relaxation time of its individual components) and that relaxation time can be passed to the limit in (11.9). Making a rigorous proof of (11.11) seems hard.

11.5.2 The diffusion heuristic.

Continuing the discussion above, Theorem 11.3 says that the long-run proportion of proposed moves which are accepted is $2\Phi(-\kappa\sigma/2)$. At the optimal value $\sigma=2.38/\kappa$ we find this proportion is a “pure number” $0.23$, which does not depend on $f$. To quote [292]

This result gives rise to the useful heuristic for random walk Metropolis in practice:

Tune the proposal variance so that the average acceptance rate is roughly $1/4$.

We call this the diffusion heuristic for proposal-step scaling. Intuitively one might hope that the heuristic would be effective for fairly general unimodel target densities on $R^{d}$, though it clearly has nothing to say about the problem of passage between modes in a multimodal target. Note also that to invoke the diffusion heuristic in a combinatorial setting, where the proposal chain is random walk on a graph, one needs to assume that the target distribution is “smooth” in the sense that $\pi(v)/\pi(w)\approx 1$ for a typical edge $(v,w)$. In this case one can make a Metropolis chain in which the proposal chain jumps $\sigma$ edges in one step, and seek to optimize $\sigma$. See Roberts [294] for some analysis in the context of smooth distributions on the $d$-cube. However, such smoothness assumptions seem inapplicable to most practical combinatorial MCMC problems.

11.5.3 Sketch proof of Theorem

Write a typical step of the proposal chain as

 $(x_{1},x_{2},\ldots,x_{d})\to(x_{1}+\xi_{1},x_{2}+\xi_{2},\ldots,x_{d}+\xi_{d}).$

Write

 $J=\log\frac{f(x_{i}+\xi_{1})}{f(x_{1})};\quad S=\log\prod_{i=2}^{d}\frac{f(x_{% i}+\xi_{i})}{f(x_{i})}.$

The step is accepted with probability $\min(1,\prod_{i=1}^{d}\frac{f(x_{i}+\xi_{i})}{f(x_{i})})=\min(1,e^{J+S})$. So the increment of the first coordinate of the Metropolis chain has mean and mean-square $E\xi_{1}\ \min(1,e^{J+S})$ and $E\xi_{1}^{2}\ \min(1,e^{J+S})$. The essential issue in the proof is to show that, for “typical” values of $(x_{2},\ldots,x_{n})$,

 $\displaystyle E\xi_{1}\ \min(1,e^{J+S})$ $\displaystyle\sim$ $\displaystyle\theta\mu(x_{1})/d$ (11.12) $\displaystyle E\xi_{1}^{2}\ \min(1,e^{J+S})$ $\displaystyle\sim$ $\displaystyle\theta/d.$ (11.13)

This identifies the asymptotic drift and variance rates of $Y_{d}(t)$ with those of $Y(t)$.

Write $h(u):=E\min(1,e^{u+S})$. Since

 $J\approx\log\left(1+{\textstyle\frac{f^{\prime}(x_{1})}{f(x_{1})}}\xi_{1}% \right)\approx{\textstyle\frac{f^{\prime}(x_{1})}{f(x_{1})}}\xi_{1}=2\mu(x_{1}% )\xi_{1},$

the desired estimates (11.12,11.13) can be rewritten as

 $\displaystyle EJh(J)$ $\displaystyle\sim$ $\displaystyle 2\theta\mu^{2}(x_{1})/d$ (11.14) $\displaystyle EJ^{2}h(J)$ $\displaystyle\sim$ $\displaystyle 4\theta\mu^{2}(x_{1})/d.$ (11.15)

Now if $J$ has Normal$(0,\beta^{2})$ distribution then for sufficiently regular $h(\cdot)$ we have

 $EJh(J)\sim\beta^{2}h^{\prime}(0);\quad EJ^{2}h(J)\sim\beta^{2}h(0)\mbox{ as }% \beta\to 0.$

Since $J$ has approximately Normal$(0,4\mu^{2}(x_{1}){\rm var}\ \xi_{1}=4\mu^{2}(x_{1})\sigma^{2}/d)$ distribution, proving (11.14,11.15) reduces to proving

 $\displaystyle h^{\prime}(0)$ $\displaystyle\to$ $\displaystyle\frac{\theta}{2\sigma^{2}}$ (11.16) $\displaystyle h(0)$ $\displaystyle\to$ $\displaystyle\frac{\theta}{\sigma^{2}}.$ (11.17)

We shall argue

 ${\rm dist}(S)\mbox{ is approximately Normal}(-\kappa^{2}\sigma^{2}/2,\kappa^{2% }\sigma^{2}).$ (11.18)

Taking the first two terms in the expansion of $\log(1+u)$ gives

 $\log{\textstyle\frac{f(x_{i}+\xi_{i})}{f(x_{i})}}\approx{\textstyle\frac{f^{% \prime}(x_{i})}{f(x_{i})}}\ \xi_{i}-{\textstyle\frac{1}{2}}\left({\textstyle% \frac{f^{\prime}(x_{i})}{f(x_{i})}}\right)^{2}\xi_{i}^{2}.$

Write $K({\bf x})=d^{-1}\sum_{i=2}^{d}({\textstyle\frac{f^{\prime}(x_{i})}{f(x_{i})}}% )^{2}$. Summing the previous approximation over $i$, the first sum on the right has approximately Normal$(0,\sigma^{2}K({\bf x}))$ distribution, and (using the weighted law of large numbers) the second term is approximately $-\frac{1}{2}\sigma^{2}K({\bf x})$. So the distribution of $S$ is approximately Normal$(-K({\bf x})\sigma^{2}/2,K({\bf x})\sigma^{2})$. But by the law of large numbers, for a typical ${\bf x}$ drawn from the product distribution $\pi_{f}$ we have $K({\bf x})\approx\kappa^{2}$, giving (11.18).

To argue (11.17) we pretend $S$ has exactly the Normal distribution at (11.18). By a standard formula, if $S$ has Normal$(\alpha,\beta^{2})$ distribution then

 $E\max(1,e^{S})=\Phi(\alpha/\beta)+e^{\alpha+\beta^{2}/2}\Phi(-\beta-\alpha/% \beta).$

 $h(0)=2\Phi(-\kappa\sigma/2)$
which verifies (11.17). From the definition of $h(u)$ we see
 $h^{\prime}(0)=Ee^{S}1_{(S\leq 0)}=h(0)-P(S\geq 0)=h(0)-\Phi(-\kappa\sigma/2)=% \Phi(-\kappa\sigma/2)$