# 9.3 Self-verifying algorithms for sampling from a stationary distribution

To start with an analogy, we can in principle compute a mean hitting time $E_{i}T_{j}$ from the transition matrix ${\bf P}$, but we could alternatively estimate $E_{i}T_{j}$ by “pure simulation”: simulate $m$ times the chain started at $i$ and run until hitting $j$, and then (roughly speaking) the empirical average of these $m$ hitting times will be $(1\pm O(m^{-1/2}))E_{i}T_{j}$. In particular, for fixed $\varepsilon$ we can (roughly speaking) estimate $E_{i}T_{j}$ to within a factor $(1\pm\varepsilon)$ in $O(E_{i}T_{j})$ steps. Analogously, consider some notion of mixing time $\tau$ (say $\tau_{1}$ or $\tau_{2}$, in the reversible setting). The focus in this book has been on theoretical methods for bounding $\tau$ in terms of ${\bf P}$, and of theoretical consequences of such bounds. But can we bound $\tau$ by pure simulation? More importantly, in the practical context of Markov chain Monte Carlo, can we devise a “self-verifying” algorithm which produces an approximately-stationary sample from a chain in $O(\tau)$ steps without having prior knowledge of $\tau$?

xxx tie up with MCMC discussion.

To say things a little more carefully, a “pure simulation” algorithm is one in which the transition matrix ${\bf P}$ is unknown to the algorithm. Instead, there is a list of the states, and at each step the algorithm can obtain, for any state $i$, a sample from the jump distribution $p(i,\cdot)$, independent of previous samples.

In the MCMC context we typically have an exponentially large state space and seek polynomial-time estimates. The next lemma (which we leave to the reader to state and prove more precisely) shows that no pure simulation algorithm can guarantee to do this.

###### Lemma 9.16

Consider a pure simulation algorithm which, given any irreducible $n$-state chain, eventually outputs a random state whose distribution is guaranteed to be within $\varepsilon$ of the stationary distribution in variation distance. Then the algorithm must take $\Omega(n)$ steps for every ${\bf P}$.

Outline of proof. If there is a state $k$ with the property that $1-p(k,k)$ is extremely small, then the stationary distribution will be almost concentrated on $k$; an algorithm which has some chance of terminating without sampling a step from every state cannot possibly guarantee that no unvisited state $k$ has this property. $\Box$

## 9.3.1 Exact sampling via the Markov chain tree theorem

Lovasz and Winkler [241] observed that the Markov chain tree theorem (Theorem 9.10) could be used to give a “pure simulation” algorithm for generating exactly from the stationary distribution of an arbitrary $n$-state chain. The algorithm takes

 $O(\tau_{1}^{*}\ n^{2}\log n)$ (9.14)

steps, where $\tau_{1}^{*}$ is the mixing time parameter defined as the smallest $t$ such that

 $P_{i}(X_{U_{\sigma}}=j)\geq\frac{1}{2}\pi_{j}\mbox{ for all }i,j\in I,\ \sigma\geq t$ (9.15)

where $U_{\sigma}$ denotes a random time uniform on $\{0,1,\ldots,\sigma-1\}$, independent of the chain.

xxx tie up with Chapter 4 discussion and [240].

The following two facts are the mathematical ingredients of the algorithm. We quote as Lemma 9.17(a) a result of Ross [299] (see also [53] Theorem XIV.37); part (b) is an immediate consequence.

###### Lemma 9.17

(a) Let $\pi$ be a probability distribution on $I$ and let $(F_{i};i\in I)$ be independent with distribution $\pi$. Fix $j$, and consider the digraph with edges $\{(i,F_{i}):i\neq j\}$. Then with probability (exactly) $\pi_{j}$, the digraph is a tree with edges directed toward the root $j$.

(b) So if $j$ is first chosen uniformly at random from $I$, then the probability above is exactly $1/n$.

As the second ingredient, observe that the Markov chain tree formula (Corollary 9.11) can be rephrased as follows.

###### Corollary 9.18

Let $\pi$ be the stationary distribution for a transition matrix ${\bf P}$ on $I$. Let $J$ be random, uniform on $I$. Let $(\xi_{i};i\in I)$ be independent, with $P(\xi_{i}=j)=p_{ij}$. Consider the digraph with edges $\{(i,\xi_{i}):i\neq J\}$. Then, conditional on the digraph being a tree with edges directed toward the root $J$, the probability that $J=j$ equals $\pi_{j}$.

So consider the special case of a chain with the property

 $p^{*}_{ij}\geq(1/2)^{1/n}\pi_{j}\ \forall i,j.$ (9.16)

The probability of getting any particular digraph under the procedure of Corollary 9.18 is at least $1/2$ the probability of getting that digraph under the procedure of Lemma 9.17, and so the probability of getting some tree is at least $1/2n$, by Lemma 9.17(b). So if the procedure of Corollary 9.18 is repeated $r=\lceil 2n\log 4\rceil$ times, the chance that some repetition produces a tree is at least $1-(1-1/2n)^{2n\log 4}=3/4$, and then the root $J$ of the tree has distribution exactly $\pi$.

Now for any chain, fix $\sigma>\tau_{1}^{*}$. The submultiplicativity (yyy) property of separation, applied to the chain with transition probabilities $\tilde{p}_{ij}=P_{i}(X_{U_{\sigma}}=j)$, shows that if $V$ denotes the sum of $m$ independent copies of $U_{\sigma}$, and $\xi_{i}$ is the state reached after $V$ steps of the chain started at $i$, then

 $P(\xi_{i}=j)\equiv P_{i}(X_{V}=j)\geq(1-2^{-m})\pi_{j}\ \forall i,j.$

So putting $m=-\log_{2}(1-(1/2)^{1/n})=\Theta(\log n)$, the set of probabilities $(P(\xi_{i}=j))$ satisfy (9.16).

Combining these procedures, we have (for fixed $\sigma>\tau_{1}^{*}$) an algorithm which, in a mean number $nm\sigma r=O(\sigma n^{2}\log n)$ of steps, has chance $\geq 3/4$ to produce an output, and (if so) the output has distribution exactly $\pi$. Of course we initially don’t know the right $\sigma$ to use, but we simply try $n,2n,4n,8n,\ldots$ in turn until some output appears, and the mean total number of steps will satisfy the asserted bound (9.14).

## 9.3.2 Approximate sampling via coalescing paths

A second approach involves the parameter $\tau_{0}=\sum_{j}\pi_{j}E_{i}T_{j}$ arising in the random target lemma (Chapter 2 yyy). Aldous [24] gives an algorithm which, given ${\bf P}$ and $\varepsilon>0$, outputs a random state $\xi$ for which $||P(\xi\in\cdot)-\pi||\leq\varepsilon$, and such that the mean number of steps is at most

 $81\tau_{0}/\varepsilon^{2}.$ (9.17)

The details are messy, so let us just outline the (simple) underlying idea. Suppose we can define a procedure which terminates in some random number $Y$ of steps, where $Y$ is an estimate of $\tau_{0}$: precisely, suppose that for any ${\bf P}$

 $P(Y\leq\tau_{0})\leq\varepsilon;\ \ \ EY\leq K\tau_{0}$ (9.18)

where $K$ is an absolute constant. We can then define an algorithm as follows.

Simulate $Y$; then run the chain for $U_{Y/\varepsilon}$ steps and output the final state $\xi$

where as above $U_{\sigma}$ denotes a random time uniform on $\{0,1,\ldots,\sigma-1\}$, independent of the chain. This works because, arguing as at xxx,

 $||P(X_{U_{\sigma}}\in\cdot)-\pi||\leq\tau_{0}/\sigma$

and so

 $||P(\xi\in\cdot)-\pi||\ \leq E\max(1,\frac{\tau_{0}}{Y/\varepsilon})\leq 2\varepsilon.$

And the mean number of steps is $(1+\frac{1}{2\varepsilon})EY$.

So the issue is to define a procedure terminating in $Y$ steps, where $Y$ satisfies (9.18). Label the states $\{1,2,\ldots,n\}$ and consider the following coalescing paths routine.

(i) Pick a uniform random state $J$.

(ii) Start the chain at state $1$, run until hitting state $J$, and write $A_{1}$ for the set of states visited along the path.

(iii) Restart the chain at state $\min\{j:j\not\in A_{1}\}$, run until hitting some state in $A_{1}$, and write $A_{2}$ for the union of $A_{1}$ and the set of states visited by this second path.

(iiii) Restart the chain at state $\min\{j:j\not\in A_{2}\}$, and continue this procedure until every state has been visited. Let $Y$ be the total number of steps.

The random target lemma says that the mean number of steps in (ii) equals $\tau_{0}$, making this $Y$ a plausible candidate for a quantity satisfying (9.18). A slightly more complicated algorithm is in fact needed – see [24].

## 9.3.3 Exact sampling via backwards coupling

Write $U$ for a r.v. uniform on $[0,1]$, and $(U_{t})$ for an independent sequence of copies of $U$. Given a probability distribution on $I$, we can find a (far from unique!) function $f:[0,1]\rightarrow I$ such that $f(U)$ has the prescribed distribution. So given a transition matrix ${\bf P}$ we can find a function $f:I\times[0,1]\rightarrow I$ such that $P(f(i,U)=j)=p_{ij}$. Fix such a function. Simultaneously for each state $i$, define

 $X^{(i)}_{0}=i;\ \ \ X^{(i)}_{t}=f(X^{(i)}_{t-1},U_{t}),t=1,2,\ldots.$

xxx tie up with coupling treatment

Consider the (forwards) coupling time

 $C^{*}=\min\{t:X^{(i)}_{t}=X^{(j)}_{t}\ \forall i,j\}\leq\infty.$

By considering an initial state $j$ chosen according to the stationary distribution $\pi$,

 $\max_{i}||P_{i}(X_{t}\in\cdot)-\pi||\leq P(C>t).$

This can be used as the basis for an approximate sampling algorithm. As a simple implementation, repeat $k$ times the procedure defining $C^{*}$, suppose we get finite values $C^{*}_{1},\ldots,C^{*}_{k}$ each time, then run the chain from an arbitrary initial start for $\max_{1\leq j\leq k}C^{*}_{j}$ steps and output the final state $\xi$. Then the error $||P(\xi\in\cdot)-\pi||$ is bounded by a function $\delta(k)$ such that $\delta(k)\rightarrow 0$ as $k\rightarrow\infty$.

Propp and Wilson [286] observed that by using instead a backwards coupling method (which has been exploited in other contexts – see Notes) one could make an exact sampling algorithm. Regard our i.i.d. sequence $(U_{t})$ as defined for $-\infty. For each state $i$ and each time $s<0$ define

 $X^{(i,s)}_{s}=i;\ \ \ X^{(i,s)}_{t}=f(X^{(i,s)}_{t-1},U_{t}),t=s+1,s+2,\ldots,0.$

Consider the backwards coupling time

 $C=\max\{t:X^{(i,t)}_{0}=X^{(j,t)}_{0}\ \forall i,j\}\geq-\infty.$
###### Lemma 9.19 (Backwards coupling lemma)

If $S$ is a random time such that $-\infty a.s. then the random variable $X^{(i,S)}$ does not depend on $i$ and has the stationary distribution $\pi$.

xxx describe algorithm

xxx poset story

xxx analysis in general setting and in poset setting.

xxx compare the 3 methods