Within the Gibbs or hit-and-run scheme, at each step one needs to sample from a one-dimensional distribution, but a different one-dimensional distribution each time. As mentioned in section 11.1.2, this is in general not easy to implement efficiently. An alternative is Metropolized line sampling, where one instead takes a single step of a Metropolis (i.e. propose/accept) chain with the correct stationary distribution. To say the idea abstractly, in the general “line sampling” setting of section 11.2.2, assume also:
for each we have an irreducible transition matrix on whose stationary distribution is .
Then define a step of the Metropolized line sampler as
choose from ;
then choose from .
It is easy to check that the chain has stationary distribution , and is reversible if the are reversible, so in particular if the are defined by a Metropolis-type propose-accept scheme. In the simplest setting where the line sampler is the Gibbs sampler and we use the same one-dimensional proposal step distribution each time, this scheme is Metropolis-within-Gibbs. In that context is seems intuitively natural to use a long-tailed proposal distribution such as the Cauchy distribution. Because we might encounter wildly different one-dimensional target densities, e.g. one density with s.d. and another with two modes separated by , and using a step proposal would be inefficient in the latter case if is small, and inefficient in the former case if is large. Intuitively, a long-tailed distribution avoids these worst cases, at the cost of having the acceptance rate be smaller in good cases.
In the setting (section 11.2.1) of the Metropolis scheme, one might consider making several draws from the proposal distribution and choosing one of them to be the proposed move. Here is one way, suggested by Liu et al [235], to implement this idea. It turns out that to ensure the stationary distribution is the target distribution , we need extra samples which are used only to adjust the acceptance probability of the proposed step.
For simplicity, we take the case of a symmetric proposal matrix . Fix . Define a step from of the multiple-try Metropolis (MTM) chain as follows.
Choose independently from ;
Choose with probability proportional to ;
Choose independently from , and set ;
Accept the proposed move with probability .
Irreducibility follows from irreducibility of . To check detailed balance, write the acceptance probability as . Then
where the first sum is over ordered -tuples . So we can write
The choice of makes the final term become . One can now check , by switching the roles of and .
To compare MTM with single-try Metropolis, consider the limit, in which the empirical distribution of will approach , and so the distribution of the chosen will approach for . Thus for large the transition matrix of MTM will approximate
To compare with single-try Metropolis , rewrite both as
Thinking of a step of the proposal chain as being in a random direction unrelated to the behavior of , from a -typical state we expect a proposed move to tend to make decrease, so we expect for -typical . In this sense, the equations above show that MTM is an improvement. Of course, if we judge “cost” in terms of the number of evaluations of , then a step of MTM costs times the cost of single-step Metropolis. By this criterion it seems implausible that MTM would be cheaper than single-step. On the other hand one can envisage settings where there is substantial cost in updating a data structure associated with the current state , and in such a setting MTM may be more appealing.
Writing , as in the statistical physics imagery (section 11.1.2), suggests defining a one-parameter family of probability distributions by
(In the physics analogy, corresponds to 1/temperature). If is multimodal we picture , as increases from to , interpolating between the uniform distribution and by making the potential wells grow deeper. Fix a proposal matrix , and let be the transition matrix for the Metropolized chain (11.5) associated with and . Now fix and values . The idea is that for small the -chain should have less difficulty moving between wells; for we get the correct distribution within each well; so by varying we can somehow sample accurately from all wells. There are several ways to implement this idea. Simulated tempering [254] defines a chain on state space , where state represents configuration and parameter , and where each step is either of the form
; a step of
or of the form
; where is a proposed step of simple random walk on .
However, implementing this idea is slightly intricate, because normalizing constants enter into the desired acceptance probabilities. A more elegant variation is the multilevel exchange chain suggested by Geyer [163] and implemented in statistical physics by Hukushima and Nemoto [185]. First consider independent chains, where the ’th chain has transition matrix . Then introduce an interaction; propose to switch configurations and , and accept with the appropriate probability. Precisely, take state space with states . Fix a (small) number .
With probability pick uniformly from , pick according to and update by changing to .
With probability , pick uniformly an adjacent pair , and propose to update by replacing by . Accept this proposed move with probability
To check that the product is indeed a stationary distribution, write the acceptance probability as . If and differ only by interchange of then
and the definition of makes the expression . The case of steps where only one component changes is easier to check.
Consider the setting of section 11.2.2. There is a target distribution on and a collection of subsets . Write and . Now fix . We can use the line-sampling scheme of section 11.2.2 to define (recall Chapter 4 section 6.2) (yyy 10/11/94 version) a product chain on with stationary distribution . For this product chain, picture particles, at each step picking a random particle and making it move as a step from the line-sampling chain. Now let us introduce an interaction: the line along which a particle moves may depend on the positions of the other particles.
Here is a precise construction. Suppose that for each we are given a probability distribution on satisfying the following analog of (11.6):
(11.8) |
A step of the chain from is defined by
Pick uniformly from
Pick from
Pick from
Update by replacing by .
It is easy to check that is indeed a stationary distribution; and the chain is irreducible under condition (11.7). Of course we could, as in section 11.3.1, use a Metropolis step instead of sampling from .
Constructions of this type in statistical applications on go back to Gilks et al [166], under the name adaptive directional sampling. In particular they suggested picking a distinct pair of the “particles” and taking the straight line through and as the line to sample from. Liu et al [235] suggest combining this idea with mode-hunting. Again pick a distinct pair of “particles”; but now use some algorithm to find a local maximum of the target density starting from , and sample from the line through and .