11 Markov Chain Monte Carlo (January 8 2001)11.4 A little theory 11.6 Other theory

11.5 The diffusion heuristic for optimal scaling of high dimensional Metropolis

In any Metropolis scheme for sampling from a target distribution on $R^{d}$ , there arises the question of how large to take the steps of the proposal chain. One can answer this for isotropic-proposal schemes in high dimensions, in the setting where the target is a product distribution, and the result in this (very artificial) setting provides a heuristic for more realistic settings exemplified by (11.1).

11.5.1 Optimal scaling for high-dimensional product distribution sampling

Fix a probability density function $f$ on $R^{1}$ . For large $d$ consider the i.i.d. product distribution $\pi_{f}(d{\bf x})=\prod_{i=1}^{d}f(x_{i})\ dx_{i}$ for ${\bf x}=(x_{i})\in R^{d}$ . Suppose we want to sample from $\pi_{f}$ using Metropolis or Gibbs; what is the optimal scaling (as a function of $d$ ) for the step size of the proposal chain, and how does the relaxation time scale?

For the Gibbs sampler this question is straightforward. Consider the one-dimensional case, and take the proposal step increments to be Normal $(0,\sigma^{2})$ . Then (under technical conditions on $f$ – we omit technical conditions here and in Theorem 11.3) the Gibbs chain will have some finite relaxation time depending on $f$ and $\sigma$ , and choosing the optimal $\sigma^{*}$ gives a relaxation time $\tau_{2}(f)$ , say. The Gibbs sampler chain in which we choose a random coordinate and propose changing only that coordinate (using the optimal $\sigma^{*}$ above) is a product chain in the sense of Chapter 4 section 6.2 (yyy 10/11/94 version), and so the relaxation time of this product chain is $\tau_{2}^{{\rm Gibbs}}(f)=\tau_{2}(f)\ d$ .

Though the argument above is very simple, it is unsatisfactory because there is no simple expression for relaxation time as a function of $\sigma$ or for the optimal $\sigma^{*}$ . It turns out that this difficulty is eliminated in the isotropic-proposal Metropolis chain. In the Gibbs sampler above, the variance of the length of a proposed step is $\sigma^{2}$ , so we retain this property by specifying the steps of the proposal chain to have Normal $(0,\sigma^{2}d^{-1}{\bf I}_{d})$ distribution. One expects the relaxation time to grow linearly in $d$ in this setting also. The following result of Roberts et al [292] almost proves this, and has other useful corollaries.

Theorem 11.3

Fix $\sigma>0$ . Let $({\bf X}(t),t=0,1,2,\ldots)$ be the Metropolis chain for sampling from product measure $\pi_{f}$ on $R^{d}$ based on a proposal random walk with step distribution Normal $(0,\sigma^{2}d^{-1}{\bf I}_{d})$ . Write $X^{(1)}(t)$ for the first coordinate of ${\bf X}(t)$ , and let $Y_{d}(t):=X^{(1)}(\lfloor td\rfloor)$ be this coordinate process speeded up by a factor $d$ , for continuous $0\leq t<\infty$ . Suppose ${\bf X}(0)$ has the stationary distribution $\pi_{f}$ . Then

(Y_{d}(t),0\leq t<\infty)\ \stackrel{d}{\rightarrow}\ (Y(t),0\leq t<\infty)% \mbox{ as }d\to\infty

(11.9)

where the limit process is the stationary one-dimensional diffusion

dY_{t}=\theta^{1/2}dW_{t}+\theta\mu(Y_{t})dt

(11.10)

for standard Brownian motion $W_{t}$ , where

$\displaystyle\mu(y)$	$\displaystyle:=$	$\displaystyle{\textstyle\frac{f^{\prime}(y)}{2f(y)}}$
$\displaystyle\theta$	$\displaystyle:=$	$\Phi$
$\displaystyle\kappa$	$\displaystyle:=$	$\displaystyle\left(\int{\textstyle\frac{(f^{\prime}(x))^{2}}{f(x)}}\ dx\right)% ^{1/2}.$

Moreover, as $d\to\infty$ the proportion of accepted proposals in the stationary chain tends to $2\Phi(-\sigma\kappa/2)$ .

We outline the proof in section 11.5.3. The result may look complicated, so one piece of background may be helpful. Given a probability distribution on the integers, there is a Metropolis chain for sampling from it based on the simple random walk proposal chain. As a continuous-space analog, given a density $f$ on $R^{1}$ there is a “Metropolis diffusion” with stationary density $f$ based on $\theta^{1/2}W_{t}$ (for arbitrary constant $\theta$ ) as “proposal diffusion”, and this Metropolis diffusion is exactly the diffusion (11.10): see Notes to (yyy final Chapter).

Thus the appearance of the limit diffusion $Y$ is not unexpected; what is important is the explicit formula for $\theta$ in terms of $\sigma$ and $f$ . Note that the parameter $\theta$ affects the process $(Y_{t})$ only as a speed parameter. That is, if $Y^{*}_{t}$ is the process (11.10) with $\theta=1$ then the general process can be represented as $Y_{t}=Y^{*}_{\theta t}$ . In particular, the relaxation time scales as $\tau_{2}(Y)=\theta^{-1}\tau_{2}(Y^{*})$ . Thus we seek to maximize $\theta$ as a function of the underlying step variance $\sigma$ , and a simple numerical calculation shows this is maximized by taking $\sigma=2.38/\kappa$ , giving $\theta=1.3/\kappa^{2}$ .

Thus Theorem 11.3 suggests that for the Metropolis chain ${\bf X}$ , the optimal variance is $2.38^{2}\kappa^{-2}d^{-1}{\bf I}_{d}$ , and suggests that the relaxation time $\tau_{2}(f,d)$ scales as

\tau_{2}(f,d)\sim\frac{d\kappa^{2}}{1.3}\ \tau_{2}(Y^{*}).

(11.11)

In writing (11.11) we are pretending that the Metropolis chain is a product chain (so that its relaxation time is the relaxation time of its individual components) and that relaxation time can be passed to the limit in (11.9). Making a rigorous proof of (11.11) seems hard.

11.5.2 The diffusion heuristic.

Continuing the discussion above, Theorem 11.3 says that the long-run proportion of proposed moves which are accepted is $2\Phi(-\kappa\sigma/2)$ . At the optimal value $\sigma=2.38/\kappa$ we find this proportion is a “pure number” $0.23$ , which does not depend on $f$ . To quote [292]

This result gives rise to the useful heuristic for random walk Metropolis in practice:

Tune the proposal variance so that the average acceptance rate is roughly $1/4$ .

We call this the diffusion heuristic for proposal-step scaling. Intuitively one might hope that the heuristic would be effective for fairly general unimodel target densities on $R^{d}$ , though it clearly has nothing to say about the problem of passage between modes in a multimodal target. Note also that to invoke the diffusion heuristic in a combinatorial setting, where the proposal chain is random walk on a graph, one needs to assume that the target distribution is “smooth” in the sense that $\pi(v)/\pi(w)\approx 1$ for a typical edge $(v,w)$ . In this case one can make a Metropolis chain in which the proposal chain jumps $\sigma$ edges in one step, and seek to optimize $\sigma$ . See Roberts [294] for some analysis in the context of smooth distributions on the $d$ -cube. However, such smoothness assumptions seem inapplicable to most practical combinatorial MCMC problems.

11.5.3 Sketch proof of Theorem

Write a typical step of the proposal chain as

(x_{1},x_{2},\ldots,x_{d})\to(x_{1}+\xi_{1},x_{2}+\xi_{2},\ldots,x_{d}+\xi_{d}).

Write

J=\log\frac{f(x_{i}+\xi_{1})}{f(x_{1})};\quad S=\log\prod_{i=2}^{d}\frac{f(x_{% i}+\xi_{i})}{f(x_{i})}.

The step is accepted with probability $\min(1,\prod_{i=1}^{d}\frac{f(x_{i}+\xi_{i})}{f(x_{i})})=\min(1,e^{J+S})$ . So the increment of the first coordinate of the Metropolis chain has mean and mean-square $E\xi_{1}\ \min(1,e^{J+S})$ and $E\xi_{1}^{2}\ \min(1,e^{J+S})$ . The essential issue in the proof is to show that, for “typical” values of $(x_{2},\ldots,x_{n})$ ,

	$\displaystyle E\xi_{1}\ \min(1,e^{J+S})$	$\displaystyle\sim$	$\displaystyle\theta\mu(x_{1})/d$		(11.12)
	$\displaystyle E\xi_{1}^{2}\ \min(1,e^{J+S})$	$\displaystyle\sim$	$\displaystyle\theta/d.$		(11.13)

This identifies the asymptotic drift and variance rates of $Y_{d}(t)$ with those of $Y(t)$ .

Write $h(u):=E\min(1,e^{u+S})$ . Since

J\approx\log\left(1+{\textstyle\frac{f^{\prime}(x_{1})}{f(x_{1})}}\xi_{1}% \right)\approx{\textstyle\frac{f^{\prime}(x_{1})}{f(x_{1})}}\xi_{1}=2\mu(x_{1}% )\xi_{1},

the desired estimates (11.12,11.13) can be rewritten as

	$\displaystyle EJh(J)$	$\displaystyle\sim$	$\displaystyle 2\theta\mu^{2}(x_{1})/d$		(11.14)
	$\displaystyle EJ^{2}h(J)$	$\displaystyle\sim$	$\displaystyle 4\theta\mu^{2}(x_{1})/d.$		(11.15)

Now if $J$ has Normal $(0,\beta^{2})$ distribution then for sufficiently regular $h(\cdot)$ we have

EJh(J)\sim\beta^{2}h^{\prime}(0);\quad EJ^{2}h(J)\sim\beta^{2}h(0)\mbox{ as }% \beta\to 0.

Since $J$ has approximately Normal $(0,4\mu^{2}(x_{1}){\rm var}\ \xi_{1}=4\mu^{2}(x_{1})\sigma^{2}/d)$ distribution, proving (11.14,11.15) reduces to proving

	$\displaystyle h^{\prime}(0)$	$\displaystyle\to$	$\displaystyle\frac{\theta}{2\sigma^{2}}$		(11.16)
	$\displaystyle h(0)$	$\displaystyle\to$	$\displaystyle\frac{\theta}{\sigma^{2}}.$		(11.17)

We shall argue

{\rm dist}(S)\mbox{ is approximately Normal}(-\kappa^{2}\sigma^{2}/2,\kappa^{2% }\sigma^{2}).

(11.18)

Taking the first two terms in the expansion of $\log(1+u)$ gives

\log{\textstyle\frac{f(x_{i}+\xi_{i})}{f(x_{i})}}\approx{\textstyle\frac{f^{% \prime}(x_{i})}{f(x_{i})}}\ \xi_{i}-{\textstyle\frac{1}{2}}\left({\textstyle% \frac{f^{\prime}(x_{i})}{f(x_{i})}}\right)^{2}\xi_{i}^{2}.

Write $K({\bf x})=d^{-1}\sum_{i=2}^{d}({\textstyle\frac{f^{\prime}(x_{i})}{f(x_{i})}}% )^{2}$ . Summing the previous approximation over $i$ , the first sum on the right has approximately Normal $(0,\sigma^{2}K({\bf x}))$ distribution, and (using the weighted law of large numbers) the second term is approximately $-\frac{1}{2}\sigma^{2}K({\bf x})$ . So the distribution of $S$ is approximately Normal $(-K({\bf x})\sigma^{2}/2,K({\bf x})\sigma^{2})$ . But by the law of large numbers, for a typical ${\bf x}$ drawn from the product distribution $\pi_{f}$ we have $K({\bf x})\approx\kappa^{2}$ , giving (11.18).

To argue (11.17) we pretend $S$ has exactly the Normal distribution at (11.18). By a standard formula, if $S$ has Normal $(\alpha,\beta^{2})$ distribution then

E\max(1,e^{S})=\Phi(\alpha/\beta)+e^{\alpha+\beta^{2}/2}\Phi(-\beta-\alpha/% \beta).

This leads to

h(0)=2\Phi(-\kappa\sigma/2)

which verifies (11.17). From the definition of $h(u)$ we see

h^{\prime}(0)=Ee^{S}1_{(S\leq 0)}=h(0)-P(S\geq 0)=h(0)-\Phi(-\kappa\sigma/2)=% \Phi(-\kappa\sigma/2)

which verifies (11.16).

11.4 A little theory Bibliography 11.6 Other theory

Generated on Mon Jun 2 14:23:48 2014 by LaTeXML [LOGO]