9 A Second Look at General Markov Chains (April 21, 1995)

9.3 Self-verifying algorithms for sampling from a stationary distribution

To start with an analogy, we can in principle compute a mean hitting time EiTjE_{i}T_{j} from the transition matrix 𝐏{\bf P}, but we could alternatively estimate EiTjE_{i}T_{j} by “pure simulation”: simulate mm times the chain started at ii and run until hitting jj, and then (roughly speaking) the empirical average of these mm hitting times will be (1±O(m-1/2))EiTj(1\pm O(m^{-1/2}))E_{i}T_{j}. In particular, for fixed ε\varepsilon we can (roughly speaking) estimate EiTjE_{i}T_{j} to within a factor (1±ε)(1\pm\varepsilon) in O(EiTj)O(E_{i}T_{j}) steps. Analogously, consider some notion of mixing time τ\tau (say τ1\tau_{1} or τ2\tau_{2}, in the reversible setting). The focus in this book has been on theoretical methods for bounding τ\tau in terms of 𝐏{\bf P}, and of theoretical consequences of such bounds. But can we bound τ\tau by pure simulation? More importantly, in the practical context of Markov chain Monte Carlo, can we devise a “self-verifying” algorithm which produces an approximately-stationary sample from a chain in O(τ)O(\tau) steps without having prior knowledge of τ\tau?

xxx tie up with MCMC discussion.

To say things a little more carefully, a “pure simulation” algorithm is one in which the transition matrix 𝐏{\bf P} is unknown to the algorithm. Instead, there is a list of the states, and at each step the algorithm can obtain, for any state ii, a sample from the jump distribution p(i, )p(i,\cdot), independent of previous samples.

In the MCMC context we typically have an exponentially large state space and seek polynomial-time estimates. The next lemma (which we leave to the reader to state and prove more precisely) shows that no pure simulation algorithm can guarantee to do this.

Lemma 9.16

Consider a pure simulation algorithm which, given any irreducible nn-state chain, eventually outputs a random state whose distribution is guaranteed to be within ε\varepsilon of the stationary distribution in variation distance. Then the algorithm must take Ω(n)\Omega(n) steps for every 𝐏{\bf P}.

Outline of proof. If there is a state kk with the property that 1-p(k,k)1-p(k,k) is extremely small, then the stationary distribution will be almost concentrated on kk; an algorithm which has some chance of terminating without sampling a step from every state cannot possibly guarantee that no unvisited state kk has this property. \Box

9.3.1 Exact sampling via the Markov chain tree theorem

Lovasz and Winkler [241] observed that the Markov chain tree theorem (Theorem 9.10) could be used to give a “pure simulation” algorithm for generating exactly from the stationary distribution of an arbitrary nn-state chain. The algorithm takes

O(τ1*n2logn)O(\tau_{1}^{*}\ n^{2}\log n) (9.14)

steps, where τ1*\tau_{1}^{*} is the mixing time parameter defined as the smallest tt such that

Pi(XUσ=j)12πj for all i,jI,σtP_{i}(X_{U_{\sigma}}=j)\geq\frac{1}{2}\pi_{j}\mbox{ for all }i,j\in I,\ \sigma\geq t (9.15)

where UσU_{\sigma} denotes a random time uniform on {0,1,,σ-1}\{0,1,\ldots,\sigma-1\}, independent of the chain.

xxx tie up with Chapter 4 discussion and [240].

The following two facts are the mathematical ingredients of the algorithm. We quote as Lemma 9.17(a) a result of Ross [299] (see also [53] Theorem XIV.37); part (b) is an immediate consequence.

Lemma 9.17

(a) Let π\pi be a probability distribution on II and let (Fi;iI)(F_{i};i\in I) be independent with distribution π\pi. Fix jj, and consider the digraph with edges {(i,Fi):ij}\{(i,F_{i}):i\neq j\}. Then with probability (exactly) πj\pi_{j}, the digraph is a tree with edges directed toward the root jj.

(b) So if jj is first chosen uniformly at random from II, then the probability above is exactly 1/n1/n.

As the second ingredient, observe that the Markov chain tree formula (Corollary 9.11) can be rephrased as follows.

Corollary 9.18

Let π\pi be the stationary distribution for a transition matrix 𝐏{\bf P} on II. Let JJ be random, uniform on II. Let (ξi;iI)(\xi_{i};i\in I) be independent, with P(ξi=j)=pijP(\xi_{i}=j)=p_{ij}. Consider the digraph with edges {(i,ξi):iJ}\{(i,\xi_{i}):i\neq J\}. Then, conditional on the digraph being a tree with edges directed toward the root JJ, the probability that J=jJ=j equals πj\pi_{j}.

So consider the special case of a chain with the property

pij*(1/2)1/nπji,j.p^{*}_{ij}\geq(1/2)^{1/n}\pi_{j}\ \forall i,j. (9.16)

The probability of getting any particular digraph under the procedure of Corollary 9.18 is at least 1/21/2 the probability of getting that digraph under the procedure of Lemma 9.17, and so the probability of getting some tree is at least 1/2n1/2n, by Lemma 9.17(b). So if the procedure of Corollary 9.18 is repeated r=2nlog4r=\lceil 2n\log 4\rceil times, the chance that some repetition produces a tree is at least 1-(1-1/2n)2nlog4=3/41-(1-1/2n)^{2n\log 4}=3/4, and then the root JJ of the tree has distribution exactly π\pi.

Now for any chain, fix σ>τ1*\sigma>\tau_{1}^{*}. The submultiplicativity (yyy) property of separation, applied to the chain with transition probabilities p~ij=Pi(XUσ=j)\tilde{p}_{ij}=P_{i}(X_{U_{\sigma}}=j), shows that if VV denotes the sum of mm independent copies of UσU_{\sigma}, and ξi\xi_{i} is the state reached after VV steps of the chain started at ii, then

P(ξi=j)Pi(XV=j)(1-2-m)πji,j.P(\xi_{i}=j)\equiv P_{i}(X_{V}=j)\geq(1-2^{-m})\pi_{j}\ \forall i,j.

So putting m=-log2(1-(1/2)1/n)=Θ(logn)m=-\log_{2}(1-(1/2)^{1/n})=\Theta(\log n), the set of probabilities (P(ξi=j))(P(\xi_{i}=j)) satisfy (9.16).

Combining these procedures, we have (for fixed σ>τ1*\sigma>\tau_{1}^{*}) an algorithm which, in a mean number nmσr=O(σn2logn)nm\sigma r=O(\sigma n^{2}\log n) of steps, has chance 3/4\geq 3/4 to produce an output, and (if so) the output has distribution exactly π\pi. Of course we initially don’t know the right σ\sigma to use, but we simply try n,2n,4n,8n,n,2n,4n,8n,\ldots in turn until some output appears, and the mean total number of steps will satisfy the asserted bound (9.14).

9.3.2 Approximate sampling via coalescing paths

A second approach involves the parameter τ0=jπjEiTj\tau_{0}=\sum_{j}\pi_{j}E_{i}T_{j} arising in the random target lemma (Chapter 2 yyy). Aldous [24] gives an algorithm which, given 𝐏{\bf P} and ε>0\varepsilon>0, outputs a random state ξ\xi for which ||P(ξ)-π||ε||P(\xi\in\cdot)-\pi||\leq\varepsilon, and such that the mean number of steps is at most

81τ0/ε2.81\tau_{0}/\varepsilon^{2}. (9.17)

The details are messy, so let us just outline the (simple) underlying idea. Suppose we can define a procedure which terminates in some random number YY of steps, where YY is an estimate of τ0\tau_{0}: precisely, suppose that for any 𝐏{\bf P}

P(Yτ0)ε;EYKτ0P(Y\leq\tau_{0})\leq\varepsilon;\ \ \ EY\leq K\tau_{0} (9.18)

where KK is an absolute constant. We can then define an algorithm as follows.

Simulate YY; then run the chain for UY/εU_{Y/\varepsilon} steps and output the final state ξ\xi

where as above UσU_{\sigma} denotes a random time uniform on {0,1,,σ-1}\{0,1,\ldots,\sigma-1\}, independent of the chain. This works because, arguing as at xxx,

||P(XUσ)-π||τ0/σ||P(X_{U_{\sigma}}\in\cdot)-\pi||\leq\tau_{0}/\sigma

and so

||P(ξ)-π||Emax(1,τ0Y/ε)2ε.||P(\xi\in\cdot)-\pi||\ \leq E\max(1,\frac{\tau_{0}}{Y/\varepsilon})\leq 2\varepsilon.

And the mean number of steps is (1+12ε)EY(1+\frac{1}{2\varepsilon})EY.

So the issue is to define a procedure terminating in YY steps, where YY satisfies (9.18). Label the states {1,2,,n}\{1,2,\ldots,n\} and consider the following coalescing paths routine.

(i) Pick a uniform random state JJ.

(ii) Start the chain at state 11, run until hitting state JJ, and write A1A_{1} for the set of states visited along the path.

(iii) Restart the chain at state min{j:jA1}\min\{j:j\not\in A_{1}\}, run until hitting some state in A1A_{1}, and write A2A_{2} for the union of A1A_{1} and the set of states visited by this second path.

(iiii) Restart the chain at state min{j:jA2}\min\{j:j\not\in A_{2}\}, and continue this procedure until every state has been visited. Let YY be the total number of steps.

The random target lemma says that the mean number of steps in (ii) equals τ0\tau_{0}, making this YY a plausible candidate for a quantity satisfying (9.18). A slightly more complicated algorithm is in fact needed – see [24].

9.3.3 Exact sampling via backwards coupling

Write UU for a r.v. uniform on [0,1][0,1], and (Ut)(U_{t}) for an independent sequence of copies of UU. Given a probability distribution on II, we can find a (far from unique!) function f:[0,1]If:[0,1]\rightarrow I such that f(U)f(U) has the prescribed distribution. So given a transition matrix 𝐏{\bf P} we can find a function f:I×[0,1]If:I\times[0,1]\rightarrow I such that P(f(i,U)=j)=pijP(f(i,U)=j)=p_{ij}. Fix such a function. Simultaneously for each state ii, define

X0(i)=i;   Xt(i)=f(Xt-1(i),Ut),t=1,2,.X^{(i)}_{0}=i;\ \ \ X^{(i)}_{t}=f(X^{(i)}_{t-1},U_{t}),t=1,2,\ldots.

xxx tie up with coupling treatment

Consider the (forwards) coupling time

C*=min{t:Xt(i)=Xt(j)i,j}.C^{*}=\min\{t:X^{(i)}_{t}=X^{(j)}_{t}\ \forall i,j\}\leq\infty.

By considering an initial state jj chosen according to the stationary distribution π\pi,

maxi||Pi(Xt)-π||P(C>t).\max_{i}||P_{i}(X_{t}\in\cdot)-\pi||\leq P(C>t).

This can be used as the basis for an approximate sampling algorithm. As a simple implementation, repeat kk times the procedure defining C*C^{*}, suppose we get finite values C1*,,Ck*C^{*}_{1},\ldots,C^{*}_{k} each time, then run the chain from an arbitrary initial start for max1jkCj*\max_{1\leq j\leq k}C^{*}_{j} steps and output the final state ξ\xi. Then the error ||P(ξ)-π||||P(\xi\in\cdot)-\pi|| is bounded by a function δ(k)\delta(k) such that δ(k)0\delta(k)\rightarrow 0 as kk\rightarrow\infty.

Propp and Wilson [286] observed that by using instead a backwards coupling method (which has been exploited in other contexts – see Notes) one could make an exact sampling algorithm. Regard our i.i.d. sequence (Ut)(U_{t}) as defined for -<t0-\infty<t\leq 0. For each state ii and each time s<0s<0 define

Xs(i,s)=i;   Xt(i,s)=f(Xt-1(i,s),Ut),t=s+1,s+2,,0.X^{(i,s)}_{s}=i;\ \ \ X^{(i,s)}_{t}=f(X^{(i,s)}_{t-1},U_{t}),t=s+1,s+2,\ldots,0.

Consider the backwards coupling time

C=max{t:X0(i,t)=X0(j,t)i,j}-.C=\max\{t:X^{(i,t)}_{0}=X^{(j,t)}_{0}\ \forall i,j\}\geq-\infty.
Lemma 9.19 (Backwards coupling lemma)

If SS is a random time such that -<SC-\infty<S\leq C a.s. then the random variable X(i,S)X^{(i,S)} does not depend on ii and has the stationary distribution π\pi.

xxx describe algorithm

xxx poset story

xxx analysis in general setting and in poset setting.

xxx compare the 3 methods