9.3 Self-verifying algorithms for sampling from a stationary distribution

To start with an analogy, we can in principle compute a mean hitting time $E_{i}T_{j}$ from the transition matrix ${\bf P}$, but we could alternatively estimate $E_{i}T_{j}$ by “pure simulation”: simulate $m$ times the chain started at $i$ and run until hitting $j$, and then (roughly speaking) the empirical average of these $m$ hitting times will be $(1\pm O(m^{-1/2}))E_{i}T_{j}$. In particular, for fixed $\varepsilon$ we can (roughly speaking) estimate $E_{i}T_{j}$ to within a factor $(1\pm\varepsilon)$ in $O(E_{i}T_{j})$ steps. Analogously, consider some notion of mixing time $\tau$ (say $\tau_{1}$ or $\tau_{2}$, in the reversible setting). The focus in this book has been on theoretical methods for bounding $\tau$ in terms of ${\bf P}$, and of theoretical consequences of such bounds. But can we bound $\tau$ by pure simulation? More importantly, in the practical context of Markov chain Monte Carlo, can we devise a “self-verifying” algorithm which produces an approximately-stationary sample from a chain in $O(\tau)$ steps without having prior knowledge of $\tau$?

xxx tie up with MCMC discussion.

To say things a little more carefully, a “pure simulation” algorithm is one in which the transition matrix ${\bf P}$ is unknown to the algorithm. Instead, there is a list of the states, and at each step the algorithm can obtain, for any state $i$, a sample from the jump distribution $p(i,\cdot)$, independent of previous samples.

In the MCMC context we typically have an exponentially large state space and seek polynomial-time estimates. The next lemma (which we leave to the reader to state and prove more precisely) shows that no pure simulation algorithm can guarantee to do this.

Lemma 9.16

Consider a pure simulation algorithm which, given any irreducible $n$-state chain, eventually outputs a random state whose distribution is guaranteed to be within $\varepsilon$ of the stationary distribution in variation distance. Then the algorithm must take $\Omega(n)$ steps for every ${\bf P}$.

Outline of proof. If there is a state $k$ with the property that $1-p(k,k)$ is extremely small, then the stationary distribution will be almost concentrated on $k$; an algorithm which has some chance of terminating without sampling a step from every state cannot possibly guarantee that no unvisited state $k$ has this property. $\Box$

9.3.1 Exact sampling via the Markov chain tree theorem

Lovasz and Winkler [241] observed that the Markov chain tree theorem (Theorem 9.10) could be used to give a “pure simulation” algorithm for generating exactly from the stationary distribution of an arbitrary $n$-state chain. The algorithm takes

 $O(\tau_{1}^{*}\ n^{2}\log n)$ (9.14)

steps, where $\tau_{1}^{*}$ is the mixing time parameter defined as the smallest $t$ such that

 $P_{i}(X_{U_{\sigma}}=j)\geq\frac{1}{2}\pi_{j}\mbox{ for all }i,j\in I,\ \sigma\geq t$ (9.15)

where $U_{\sigma}$ denotes a random time uniform on $\{0,1,\ldots,\sigma-1\}$, independent of the chain.

xxx tie up with Chapter 4 discussion and [240].

The following two facts are the mathematical ingredients of the algorithm. We quote as Lemma 9.17(a) a result of Ross [299] (see also [53] Theorem XIV.37); part (b) is an immediate consequence.

Lemma 9.17

(a) Let $\pi$ be a probability distribution on $I$ and let $(F_{i};i\in I)$ be independent with distribution $\pi$. Fix $j$, and consider the digraph with edges $\{(i,F_{i}):i\neq j\}$. Then with probability (exactly) $\pi_{j}$, the digraph is a tree with edges directed toward the root $j$.

(b) So if $j$ is first chosen uniformly at random from $I$, then the probability above is exactly $1/n$.

As the second ingredient, observe that the Markov chain tree formula (Corollary 9.11) can be rephrased as follows.

Corollary 9.18

Let $\pi$ be the stationary distribution for a transition matrix ${\bf P}$ on $I$. Let $J$ be random, uniform on $I$. Let $(\xi_{i};i\in I)$ be independent, with $P(\xi_{i}=j)=p_{ij}$. Consider the digraph with edges $\{(i,\xi_{i}):i\neq J\}$. Then, conditional on the digraph being a tree with edges directed toward the root $J$, the probability that $J=j$ equals $\pi_{j}$.

So consider the special case of a chain with the property

 $p^{*}_{ij}\geq(1/2)^{1/n}\pi_{j}\ \forall i,j.$ (9.16)

The probability of getting any particular digraph under the procedure of Corollary 9.18 is at least $1/2$ the probability of getting that digraph under the procedure of Lemma 9.17, and so the probability of getting some tree is at least $1/2n$, by Lemma 9.17(b). So if the procedure of Corollary 9.18 is repeated $r=\lceil 2n\log 4\rceil$ times, the chance that some repetition produces a tree is at least $1-(1-1/2n)^{2n\log 4}=3/4$, and then the root $J$ of the tree has distribution exactly $\pi$.

Now for any chain, fix $\sigma>\tau_{1}^{*}$. The submultiplicativity (yyy) property of separation, applied to the chain with transition probabilities $\tilde{p}_{ij}=P_{i}(X_{U_{\sigma}}=j)$, shows that if $V$ denotes the sum of $m$ independent copies of $U_{\sigma}$, and $\xi_{i}$ is the state reached after $V$ steps of the chain started at $i$, then

 $P(\xi_{i}=j)\equiv P_{i}(X_{V}=j)\geq(1-2^{-m})\pi_{j}\ \forall i,j.$

So putting $m=-\log_{2}(1-(1/2)^{1/n})=\Theta(\log n)$, the set of probabilities $(P(\xi_{i}=j))$ satisfy (9.16).

Combining these procedures, we have (for fixed $\sigma>\tau_{1}^{*}$) an algorithm which, in a mean number $nm\sigma r=O(\sigma n^{2}\log n)$ of steps, has chance $\geq 3/4$ to produce an output, and (if so) the output has distribution exactly $\pi$. Of course we initially don’t know the right $\sigma$ to use, but we simply try $n,2n,4n,8n,\ldots$ in turn until some output appears, and the mean total number of steps will satisfy the asserted bound (9.14).

9.3.2 Approximate sampling via coalescing paths

A second approach involves the parameter $\tau_{0}=\sum_{j}\pi_{j}E_{i}T_{j}$ arising in the random target lemma (Chapter 2 yyy). Aldous [24] gives an algorithm which, given ${\bf P}$ and $\varepsilon>0$, outputs a random state $\xi$ for which $||P(\xi\in\cdot)-\pi||\leq\varepsilon$, and such that the mean number of steps is at most

 $81\tau_{0}/\varepsilon^{2}.$ (9.17)

The details are messy, so let us just outline the (simple) underlying idea. Suppose we can define a procedure which terminates in some random number $Y$ of steps, where $Y$ is an estimate of $\tau_{0}$: precisely, suppose that for any ${\bf P}$

 $P(Y\leq\tau_{0})\leq\varepsilon;\ \ \ EY\leq K\tau_{0}$ (9.18)

where $K$ is an absolute constant. We can then define an algorithm as follows.

Simulate $Y$; then run the chain for $U_{Y/\varepsilon}$ steps and output the final state $\xi$

where as above $U_{\sigma}$ denotes a random time uniform on $\{0,1,\ldots,\sigma-1\}$, independent of the chain. This works because, arguing as at xxx,

 $||P(X_{U_{\sigma}}\in\cdot)-\pi||\leq\tau_{0}/\sigma$

and so

 $||P(\xi\in\cdot)-\pi||\ \leq E\max(1,\frac{\tau_{0}}{Y/\varepsilon})\leq 2\varepsilon.$

And the mean number of steps is $(1+\frac{1}{2\varepsilon})EY$.

So the issue is to define a procedure terminating in $Y$ steps, where $Y$ satisfies (9.18). Label the states $\{1,2,\ldots,n\}$ and consider the following coalescing paths routine.

(i) Pick a uniform random state $J$.

(ii) Start the chain at state $1$, run until hitting state $J$, and write $A_{1}$ for the set of states visited along the path.

(iii) Restart the chain at state $\min\{j:j\not\in A_{1}\}$, run until hitting some state in $A_{1}$, and write $A_{2}$ for the union of $A_{1}$ and the set of states visited by this second path.

(iiii) Restart the chain at state $\min\{j:j\not\in A_{2}\}$, and continue this procedure until every state has been visited. Let $Y$ be the total number of steps.

The random target lemma says that the mean number of steps in (ii) equals $\tau_{0}$, making this $Y$ a plausible candidate for a quantity satisfying (9.18). A slightly more complicated algorithm is in fact needed – see [24].

9.3.3 Exact sampling via backwards coupling

Write $U$ for a r.v. uniform on $[0,1]$, and $(U_{t})$ for an independent sequence of copies of $U$. Given a probability distribution on $I$, we can find a (far from unique!) function $f:[0,1]\rightarrow I$ such that $f(U)$ has the prescribed distribution. So given a transition matrix ${\bf P}$ we can find a function $f:I\times[0,1]\rightarrow I$ such that $P(f(i,U)=j)=p_{ij}$. Fix such a function. Simultaneously for each state $i$, define

 $X^{(i)}_{0}=i;\ \ \ X^{(i)}_{t}=f(X^{(i)}_{t-1},U_{t}),t=1,2,\ldots.$

xxx tie up with coupling treatment

Consider the (forwards) coupling time

 $C^{*}=\min\{t:X^{(i)}_{t}=X^{(j)}_{t}\ \forall i,j\}\leq\infty.$

By considering an initial state $j$ chosen according to the stationary distribution $\pi$,

 $\max_{i}||P_{i}(X_{t}\in\cdot)-\pi||\leq P(C>t).$

This can be used as the basis for an approximate sampling algorithm. As a simple implementation, repeat $k$ times the procedure defining $C^{*}$, suppose we get finite values $C^{*}_{1},\ldots,C^{*}_{k}$ each time, then run the chain from an arbitrary initial start for $\max_{1\leq j\leq k}C^{*}_{j}$ steps and output the final state $\xi$. Then the error $||P(\xi\in\cdot)-\pi||$ is bounded by a function $\delta(k)$ such that $\delta(k)\rightarrow 0$ as $k\rightarrow\infty$.

Propp and Wilson [286] observed that by using instead a backwards coupling method (which has been exploited in other contexts – see Notes) one could make an exact sampling algorithm. Regard our i.i.d. sequence $(U_{t})$ as defined for $-\infty. For each state $i$ and each time $s<0$ define

 $X^{(i,s)}_{s}=i;\ \ \ X^{(i,s)}_{t}=f(X^{(i,s)}_{t-1},U_{t}),t=s+1,s+2,\ldots,0.$

Consider the backwards coupling time

 $C=\max\{t:X^{(i,t)}_{0}=X^{(j,t)}_{0}\ \forall i,j\}\geq-\infty.$
Lemma 9.19 (Backwards coupling lemma)

If $S$ is a random time such that $-\infty a.s. then the random variable $X^{(i,S)}$ does not depend on $i$ and has the stationary distribution $\pi$.

xxx describe algorithm

xxx poset story

xxx analysis in general setting and in poset setting.

xxx compare the 3 methods