# 12.1 Using coupling to bound variation distance

Recall from Chapter 2 section 4.1 (yyy 9/10/99 version) several ways in which variation distance is used to measure the deviation from stationarity of the time-$t$ distribution of a Markov chain:

 $\displaystyle d_{i}(t):$ $\displaystyle=$ $\displaystyle||P_{i}(X_{t}=\cdot)-\pi(\cdot)||$ $\displaystyle d(t):$ $\displaystyle=$ $\displaystyle\max_{i}d_{i}(t)$ $\displaystyle\bar{d}(t):$ $\displaystyle=$ $\displaystyle\max_{ij}||P_{i}(X_{t}=\cdot)-P_{j}(X_{t}=\cdot)||.$

Recall also from yyy the definition of variation threshold time

 $\tau_{1}:=\min\{t:\bar{d}(t)\leq e^{-1}\}.$

Since $\tau_{1}$ is affected by continuization, when using the above definition with a discrete-time chain we write $\tau_{1}^{{\rm disc}}$ for emphasis.

Coupling provides a methodology for seeking to upper bound $\bar{d}(t)$ and hence $\tau_{1}$. After giving the (very simple) theory in section 12.1.1 and some discussion in section 12.1.2, we proceed to give a variety of examples of its use. We shall say more about theory and applications of coupling in Chapter 8 (where we discuss the related idea of coupling from the past) and in Chapter 10 (on interacting random walks).

## 12.1.1 The coupling inequality

Consider a finite-state chain in discrete or continuous time. Fix states $i,j$. Suppose we construct a coupling, that is a joint process $((X^{(i)}_{t},X^{(j)}_{t}),\ t\geq 0)$ such that

 $(X^{(i)}_{t},t\geq 0)$ is distributed as the chain started at $i$ $(X^{(j)}_{t},t\geq 0)$ (12.1)

And suppose there is a random time $T^{ij}\leq\infty$ such that

 $X^{(i)}_{t}=X^{(j)}_{t},\ T^{ij}\leq t<\infty.$ (12.2)

Call such a $T^{ij}$ a coupling time. Then the coupling inequality is

 $||P_{i}(X_{t}\in\cdot)-P_{j}(X_{t}\in\cdot)||\leq P(T^{ij}>t),\ 0\leq t<\infty.$ (12.3)

The inequality holds because

 $\displaystyle||P_{i}(X_{t}\in\cdot)-P_{j}(X_{t}\in\cdot)||$ $\displaystyle=$ $\displaystyle||P(X^{(i)}_{t}\in\cdot)-P(X^{(j)}_{t}\in\cdot)||$ $\displaystyle\leq$ $\displaystyle P(X^{(i)}_{t}\neq X^{(j)}_{t})$ $\displaystyle\leq$ $\displaystyle P(T^{ij}>t).$

## 12.1.2 Comments on coupling methodology

The coupling inequality provides a method of bounding the variation distance $\bar{d}(t)$, because if we can construct a coupling for an arbitrary pair $(i,j)$ of initial states then

 $\bar{d}(t)\leq\max_{i,j}P(T^{ij}>t).$

The reader may wish to look at a few of the examples before reading this section in detail.

In applying coupling methodology there are two issues. First we need to specify the coupling, then we need to analyze the coupling time. The most common strategy for constructing couplings is via Markov couplings, as follows. Suppose the underlying chain has state space $I$ and (to take the continuous-time case) transition rate matrix ${\bf Q}=(q(i,k))$. Consider a transition rate matrix $\widetilde{{\bf Q}}$ on the product space $I\times I$. Write the entries of $\widetilde{{\bf Q}}$ as $\tilde{q}(i,j;k,l)$ instead of the logical-but-fussy $\tilde{q}((i,j),(k,l))$. Suppose that, for each pair $(i,j)$ with $j\neq i$,

 $\tilde{q}(i,j;\cdot,\cdot)\mbox{ has marginals }q(i,\cdot)\mbox{ and }q(j,\cdot)$ (12.4)

in other words $\sum_{l}\tilde{q}(i,j;k,l)=q(i,k)$ and $\sum_{k}\tilde{q}(i,j;k,l)=q(j,l)$. And suppose that

 $\displaystyle\tilde{q}(i,i;k,k)$ $\displaystyle=$ $\displaystyle q(i,k)\mbox{ for all }k$ $\displaystyle\tilde{q}(i,i;k,l)$ $\displaystyle=$ $\displaystyle 0\mbox{ for }l\neq k.$

Take $(X^{(i)}_{t},X^{(j)}_{t})$ to be the chain on $I\times I$ with transition rate matrix $\widetilde{{\bf Q}}$ and initial position $(i,j)$, Then (12.1) must hold, and $T^{ij}:=\min\{t:X^{(i)}_{t}=X^{(j)}_{t}\}$ is a coupling time. This construction gives a natural Markov coupling, and all the examples where we use the coupling inequality will be of this form. In practice it is much more understandable to define the joint process in words, and we usually do so.

In constructing and analyzing couplings, we often exploit (explicitly or implicitly) some integer-valued metric $\rho(i,j)$ on the state space $I$. Then with a Markovian coupling,

 $P(T^{ij}>t)=P(\rho(X^{i}_{t},X^{(j)}_{t})\geq 1)$

and it is enough to study the integer-valued process $Z_{t}:=\rho(X^{(i)}_{t},X^{(j)}_{t})$. Typically $(Z_{t})$ is not Markov, but one can try to compare it with some integer-valued Markov process $(Z^{*}_{t})$. Indeed, in defining the coupling one has in mind trying to make such a comparison possible. Often one shows that for any initial $(i,j)$ the random time $T^{ij}$ is stochastically smaller than the hitting time $T^{*}_{a0}$ for the comparison chain $(Z^{*}_{t})$ to reach $0$ starting from $a:=\max_{i,j}\rho(i,j)$. This would imply

 $\bar{d}(t)\leq P(T^{*}_{a0}>t).$

Finally, one does calculations with the integer-valued chain $(Z^{*}_{t})$, either bounding the tail probability $P(T^{*}_{a0}>t)$ directly or (what is often simpler) just bounding the expectation $ET^{*}_{a0}$, so that by Markov’s inequality and the submultiplicativity property (Chapter 2 Lemma 20) (yyy 9/10/99 version) we have in continuous time

 $\tau_{1}\leq eET^{*}_{a0};\quad\bar{d}(t)\leq\exp(1-{\textstyle\frac{t}{\tau_{% 1}}}).$

Here is perhaps the simplest comparison lemma, whose proof is left to the reader.

###### Lemma 12.1 (Decreasing functional lemma)

Let $(Y_{t})$ be a Markov chain on $S$ and $f:S\to\{0,1,2,\ldots,\Delta\}$ a function. Suppose that for each $1\leq i\leq\Delta$ and each initial state $y$ with $f(y)=i$,

(i) $f(Y_{1})\leq i$;

(ii) $P(f(Y_{1})\leq i-1)\geq a_{i}>0$.

Then

 $\max_{y\in S}E_{y}T_{A}\leq\sum_{i=1}^{\Delta}1/a_{i}$

where $A:=\{y:f(y)=0\}$.

We now start a series of examples. Note that when presenting a coupling proof we don’t need to explicitly check irreducibility, because the conclusion of a bound on coupling time obviously implies irreducibility.

## 12.1.3 Random walk on a dense regular graph

(Chapter 5 Example 16).

Consider an $r$-regular $n$-vertex graph. Write $\mbox{{\cal N}}(v)$ for the set of neighbors of $v$. For any pair $v,w$ we can define a $1-1$ map $\theta_{v,w}:\mbox{{\cal N}}(v)\rightarrow\mbox{{\cal N}}(w)$ such that $\theta_{v,w}(x)=x$ for $x\in\mbox{{\cal N}}(v)\cap\mbox{{\cal N}}(w)$. Consider discrete-time random walk on the graph. We define a “greedy coupling” by specifying the joint transition matrix

 $\tilde{p}(v,w;x,\theta_{v,w}(x))=1/r,\ \ x\in\mbox{{\cal N}}(v).$

That is, from vertices $v$ and $w$, if the first chain jumps to $x$ then the second chain jumps to $\theta_{v,w}(x)$, and we maximize the chance of the two chains meeting after a single step. In general one cannot get useful bounds on the coupling time. But consider the dense case, where $r>n/2$. As observed in Chapter 5 Example 16, here $|\mbox{{\cal N}}(v)\cap\mbox{{\cal N}}(w)|\geq 2r-n$ and so the coupled process $(X_{t},Y_{t})$ has the property that for $w\neq v$

 $P(X_{t+1}=Y_{t+1}|X_{t}=v,Y_{t}=w)=\frac{|\mbox{{\cal N}}(v)\cap\mbox{{\cal N% }}(w)|}{r}\geq\frac{2r-n}{r}$

implying that the coupling time $T$ (for any initial pair of states) satisfies

 $P(T>t)\leq\left(\frac{n-r}{r}\right)^{t}.$

So the coupling inequality implies $\bar{d}(t)\leq(\frac{n-r}{r})^{t}$. In particular the variation threshold satisfies

 $\tau_{1}^{{\rm disc}}=O(1)\mbox{ as }n\to\infty,\ r/n\to\alpha>1/2.$

## 12.1.4 Continuous-time random walk on the $d$-cube

(Chapter 5 Example 15).

For ${\bf i}=(i_{1},\ldots,i_{d})$ and ${\bf j}=(j_{1},\ldots,j_{d})$ in $I=\{0,1\}^{d}$, let ${\cal D}({\bf i},{\bf j})$ be the set of coordinates $u$ where ${\bf i}$ and ${\bf j}$ differ. Write ${\bf i}^{u}$ for the state obtained by changing the $u$’th coordinate of ${\bf i}$. Recall that in continuous time the components move independently as $2$-state chains with transition rates $1/d$. Define a coupling in words as “run unmatched coordinates independently until they match, and then run them together”. Formally, the non-zero transitions of the joint process are

 $\displaystyle\tilde{q}({\bf i},{\bf j};{\bf i}^{u},{\bf j}^{u})$ $\displaystyle=$ $\displaystyle 1/d\mbox{ if }u\not\in{\cal D}({\bf i},{\bf j})$ $\displaystyle\tilde{q}({\bf i},{\bf j};{\bf i}^{u},{\bf j})$ $\displaystyle=$ $\displaystyle 1/d\mbox{ if }u\in{\cal D}({\bf i},{\bf j})$ $\displaystyle\tilde{q}({\bf i},{\bf j};{\bf i},{\bf j}^{u})$ $\displaystyle=$ $\displaystyle 1/d\mbox{ if }u\in{\cal D}({\bf i},{\bf j}).$

For each coordinate which is initially unmatched, it takes exponential (rate $2/d$) time until it is matched, and so the coupling time $T=T^{{\bf ij}}$ satisfies

 $T\ \stackrel{d}{=}\ \max(\xi_{1},\ldots,\xi_{d_{0}})$

where the $(\xi_{u})$ are independent exponential (rate $2/d$) and $d_{0}=|{\cal D}({\bf i},{\bf j})|$ is the initial number of unmatched coordinates. So

 $P(T\leq t)=(1-\exp(-2t/d))^{d_{0}}$

and the coupling inequality bounds variation distance as

 $\bar{d}(t)\leq 1-(1-\exp(-2t/d))^{d}.$

This leads to an upper bound on the variation threshold time

 $\tau_{1}\leq({\textstyle\frac{1}{2}}+o(1))d\log d\mbox{ as }d\rightarrow\infty.$

This example is discussed in more detail in Chapter 5 Example 15 (yyy 4/23/96 version) where it is shown that

 $\tau_{1}\sim{\textstyle\frac{1}{4}}d\log d\mbox{ as }d\rightarrow\infty$

so the coupling bound is off by a factor of $2$.

## 12.1.5 The graph-coloring chain

Fix a $n$-vertex graph with maximal degree $r$. Fix an integer $c\geq r+2$ and consider $[c]:=\{1,2,\ldots,c\}$ as a set of $c$ colors. Let ${\rm col}(G,c)$ be the set of $c$-colorings of $G$, where a $c$-coloring is an assignment of a color to each vertex, in such a way that no two adjacent vertices have the same color. One can put a natural “product graph” structure on ${\rm col}(G,c)$, in which two colorings are adjacent if they differ at only one vertex. It is not hard to check that the condition $c\geq r+2$ ensures that ${\rm col}(G,c)$ is non-empty and the associated graph is connected. There is a natural discrete-time Markov chain on ${\rm col}(G,c)$:

Pich a vertex $v$ of $G$ uniformly at random, pick a color $\gamma$ uniformly at random, assign color $\gamma$ to vertex $v$ if feasible (i.e. if no neighbor of $v$ has color $\gamma$), else retain existing color of $v$.

Under certain conditions a simple coupling analysis succeeds in bounding the mixing time. (The bound is far from sharp – see Notes).

###### Proposition 12.2

If $c>4r$ then $\bar{d}(t)\leq n\exp(-{\textstyle\frac{(c-4r)t}{cn}})$ and so $\tau_{1}^{{\rm disc}}\leq 1+\frac{cn}{c-4r}(1+\log n)$.

Proof. We couple two versions of the chain by simply using the same $v$ and $\gamma$ in both chains at each step. Write $D_{t}$ for the number of vertices at which the colors in the two chains differ. Then $D_{t+1}-D_{t}\in\{-1,0,1\}$ and the key estimate is the following.

###### Lemma 12.3

Conditional on the state of the coupled process at time $t$,

 $\displaystyle P(D_{t+1}=D_{t}+1)$ $\displaystyle\leq$ $\displaystyle\frac{2rD_{t}}{cn}$ (12.5) $\displaystyle P(D_{t+1}=D_{t}-1)$ $\displaystyle\geq$ $\displaystyle\frac{(c-2r)D_{t}}{cn}$ (12.6)

Proof. In order that $D_{t+1}=D_{t}+1$ it is necessary that the chosen pair $(v,\gamma)$ is such that

(*) there exists a neighbor ($w$, say) of $v$ such that $w$ has color $\gamma$ in one chain but not in the other chain.

But the total number of pairs $(v,\gamma)$ equals $nc$ while the number of pairs satisfying (*) is at most $D_{t}\cdot 2r$. This establishes (12.5). Similarly, for $D_{t+1}=D_{t}-1$ it is sufficient that $v$ is currently unmatched and that no neighbor of $v$ in either chain has color $\gamma$; the number of such pairs $(v,\gamma)$ is at least $D_{t}\cdot(c-2r)$. $\Box$

Lemma 12.3 implies $E(D_{t+1}-D_{t}|D_{t})\leq-(c-4r)D_{t}/(cn)$ and so

 $ED_{t+1}\leq\kappa ED_{t};\quad\kappa:=1-{\textstyle\frac{c-4r}{cn}}.$

Since $D_{0}\leq n$ we have, for any initial pair of states,

 $P(D_{t}\geq 1)\leq ED_{t}\leq\kappa^{t}n\leq n\exp(-{\textstyle\frac{(c-4r)t}{% cn}})$

and the coupling lemma establishes the Proposition.

## 12.1.6 Permutations and words

The examples in sections 12.1.4 and 12.1.5 were simple prototypes of interacting particle systems, more examples of which appear in Chapter 10, whose characteristic property is that a step of the chain involves only “local” change. Chains making “global” changes are often hard to analyze, but here is a simple example.

Fix a finite alphabet $A$ of size $|A|$. Fix $m$, and consider the set $A^{m}$ of “words” ${\bf x}=(x_{1},\ldots,x_{m})$ with each $x_{i}\in A$. Consider the discrete-time Markov chain on $A^{m}$ in which a step ${\bf x}\rightarrow{\bf y}$ is specified by the following two-stage procedure.

Stage 1. Pick a permutation $\sigma$ of $\{1,2,\ldots,m\}$ uniformly at random from the set of permutations $\sigma$ satisfying $x_{\sigma(i)}=x_{i}\forall i$.

Stage 2. Let $(c_{j}(\sigma);j\geq 1)$ be the cycles of $\sigma$. For each $j$, and independently as $j$ varies, pick uniformly an element $\alpha_{j}$ of $A$, and define $y_{i}=\alpha_{j}$ for every $i\in c_{j}(\sigma)$.

Here is an alternative description. Write $\Pi$ for the set of permutations of $\{1,\ldots,m\}$. Consider the bipartite graph on vertices $A^{m}\cup\Pi$ with edge-set $\{({\bf x},\sigma):x_{\sigma(i)}=x_{i}\forall i\}$. Then the chain is random walk on this bipartite graph, watched every second step (that is, when it is in $A^{m}$).

From the second description, it is clear that the stationary probabilities $\pi({\bf x})$ are proportional to the degree of ${\bf x}$ in the bipartite graph, giving

 $\pi({\bf x})\propto\prod_{a}n_{a}({\bf x})!$

where $n_{a}({\bf x})=|\{i:x_{i}=a\}|$. We shall use a coupling argument to establish the following bound on variation distance:

 $\bar{d}(t)\leq m\left(1-\frac{1}{|A|}\right)^{t}$ (12.7)

implying that the variation threshold satisfies

 $\tau_{1}^{{\rm disc}}\leq 1+\frac{1+\log m}{-\log(1-\frac{1}{|A|})}\leq 1+(1+% \log m)|A|.$

The construction of the coupling depends on the following lemma.

###### Lemma 12.4

Given finite sets $F^{1},F^{2}$ we can construct (for $u=1,2$) a uniform random permutation $\sigma^{u}$ of $F^{u}$ with cycles $(C^{u}_{j};j\geq 1)$, where the cycles are labeled such that

 $C^{1}_{j}\cap F^{1}\cap F^{2}=C^{2}_{j}\cap F^{1}\cap F^{2}\mbox{ for all }j.$

In the equality we interpret the $C^{u}_{j}$ as sets.

Proof. Given a permutation $\sigma$ of a finite set $G$, there is an induced permutation on a subset $G^{\prime}$ obtained by deleting from the cycle representation of $\sigma$ those elements not in $G^{\prime}$. It is easy to check that, for a uniform random permutation of $G$, the induced random permutation of $G^{\prime}$ is also uniform. In the setting of the lemma, take a uniform random permutation $\sigma$ of $F^{1}\cup F^{2}$, and let $\sigma^{u}$ be the induced random permutations of $F^{u}$. Then the equality holds because each side is representing the cycles of the induced permutation on $F^{1}\cap F^{2}$. $\Box$

We construct a step $({\bf x}^{1},{\bf x}^{2})\rightarrow({\bf Y}^{1},{\bf Y}^{2})$ of the coupled processes as follows. For each $a\in A$, set $F^{1,a}=\{i:x^{1}_{i}=a\},\ F^{2,a}=\{i:x^{2}_{i}=a\}$. Take random permutations $\sigma^{1,a},\ \sigma^{2,a}$ as in the lemma, with cycles $C^{1,a}_{j},C^{2,a}_{j}$. Then $(\sigma^{1,a},a\in A)$ define a uniform random permutation $\sigma^{1}$ of $\{1,\ldots,m\}$, and similarly for $\sigma^{2}$. This completes stage 1. For stage 2, for each pair $(a,j)$ pick a uniform random element $\alpha^{a}_{j}$ of $A$ and set

 $Y^{1}_{i}=\alpha^{a}_{j}\mbox{ for every }i\in C^{1,a}_{j}$
 $Y^{2}_{i}=\alpha^{a}_{j}\mbox{ for every }i\in C^{2,a}_{j}.$

This specifies a Markov coupling. By construction

 $\displaystyle\mbox{if }x^{1}_{i}=x^{2}_{i}$ $\displaystyle\mbox{ then }Y^{1}_{i}=Y^{2}_{i}$ $\displaystyle\mbox{if }x^{1}_{i}\neq x^{2}_{i}$ $\displaystyle\mbox{ then }P(Y^{1}_{i}=Y^{2}_{i})=1/|A|$

because $Y^{1}_{i}$ and $Y^{2}_{i}$ are independent uniform choices from $A$. So the coupled processes $({\bf X}^{1}(t),{\bf X}^{2}(t))$ satisfy

 $P(X^{1}_{i}(t)\neq X^{2}_{i}(t))=\left(1-\frac{1}{|A|}\right)^{t}P(X^{1}_{i}(0% )\neq X^{2}_{i}(0)).$

In particular $P({\bf X}^{1}(t)\neq{\bf X}^{2}(t))\leq m(1-1/|A|)^{t}$ and the coupling inequality (12.3) gives (12.7).

## 12.1.7 Card-shuffling by random transpositions

We mentioned in Chapter 1 section 1.4 (yyy 7/20/99 version) that card-shuffling questions provided a natural extrinsic motivation for the study of mixing times. The example here and in section 12.1.9 give a first study of mathematically (if not physically) simple random shuffles, and these discrete-time chains are prototypes for more complex chains arising in other contexts.

Consider a $d$-card deck. The random transpositions shuffle is:

Make two independent uniform choices of cards, and interchange them.

With chance $1/d$ the two choices are the same card, so no change results. To make a coupling analysis, we first give an equivalent reformulation.

Pick a label $a$ and a position $i$ uniformly at random; interchange the label-$a$ card with the card in position $i$.

This reformulation suggests the coupling in which the same choice of $(a,i)$ is used for each chain. In the coupled process (with two arbitrary starting states) let $D_{t}$ be the number of unmatched cards (that is, cards whose positions in the two decks are different) after $t$ steps. Then

(i) $D_{t+1}\leq D_{t}$.

(ii) $P(D_{t+1}\leq j-1|D_{t}=j)\geq j^{2}/d^{2}$.

Here (i) is clear, and (ii) holds because whenever the card labeled $a$ and the card in position $i$ are both unmatched, the step of the coupled chain creates at least one new match (of the card labeled $a$).

Noting that $D_{t}$ cannot take value $1$, we can use the decreasing functional lemma (Lemma 12.1) to show that the coupling time $T:=\min\{t:D_{t}=0\}$ satisfies

 $ET\leq\sum_{j=2}^{d}d^{2}/j^{2}\leq d^{2}({\textstyle\frac{\pi^{2}}{6}}-1).$

In particular, the coupling inequality implies $\tau_{1}^{{\rm disc}}=O(d^{2})$.

We revisit this example in Chapter 7 Example 18 (yyy 1/31/94 version) where it is observed that in fact

 $\tau\sim{\textstyle\frac{1}{2}}d\log d$ (12.8)

An analogous continuous-space chain on the simplex is studied in Chapter 13-4 Example 3 (yyy 7/29/99 version)

## 12.1.8 Reflection coupling on the $n$-cycle

Consider continuous-time random walk on the $n$-cycle $I=\{0,1,2,\ldots,n-1\}$. That is, the transition rates are

 $i\stackrel{1/2}{\longrightarrow}i+1;\quad i\stackrel{1/2}{\longrightarrow}i-1$

where here and below $\pm 1$ is interpreted modulo $n$. One can define a coupling by specifying the following transition rates for the bivariate process.

 $\displaystyle(i,i)\stackrel{1/2}{\longrightarrow}(i+1,i+1);$ $\displaystyle(i,i)\stackrel{1/2}{\longrightarrow}(i-1,i-1)$ (if $|j-i|>1$) $\displaystyle(i,j)\stackrel{1/2}{\longrightarrow}(i+1,j-1);$ $\displaystyle(i,j)\stackrel{1/2}{\longrightarrow}(i-1,j+1)$ $\displaystyle(i,i+1)\stackrel{1/2}{\longrightarrow}(i,i);$ $\displaystyle(i,i+1)\stackrel{1/2}{\longrightarrow}(i+1,i+1);$ $\displaystyle(i,i+1)\stackrel{1/2}{\longrightarrow}(i-1,i+2)$ (12.9)

and symmetrically for $(i+1,i)$. The joint process $((X^{(0)}_{t},X^{(k)}_{t}),t\geq 0)$ started at $(0,k)$ can be visualized as follows. Let $\phi(i):=k-i\bmod n$. Picture the operation of $\phi$ as reflection in a mirror which passes through the points $\{x_{1},x_{2}\}=\{k/2,k/2+n/2\bmod n\}$ each of which is either a vertex or the middle of an edge. In the simplest case, where $x_{1}$ and $x_{2}$ are vertices, let $(X^{0}_{t})$ be the chain started at vertex $0$, let $T^{0k}=\min\{t:X_{t}\in\{x_{1},x_{2}\}\}$ and define

 $\displaystyle X^{(k)}_{t}$ $\displaystyle=$ $\displaystyle\phi(X^{(0)}_{t}),\ t\leq T^{0k}$ $\displaystyle=$ $\displaystyle X^{(0)}_{t},\ t>T^{0k}.$

This constructs a bivariate process with the transition rates specified above, with coupling time $T^{0k}$, and the pre-$T^{0k}$ path of $X^{(k)}$ is just the reflection of the pre-$T^{0k}$ path of $X^{(0)}$. In the case where a mirror point is the middle of an edge $(j,j+1)$ and the two moving particles are at $j$ and $j+1$, we don’t want simultaneous jumps across that edge; instead (12.9) specifies that attempted jumps occur at independent times, and the process is coupled at the time of the first such jump.

It’s noteworthy that in this example the coupling inequality

 $||P_{0}(X_{t}\in\cdot)-P_{k}(X_{t}\in\cdot)||\leq P(X^{(0)}_{t}\neq X^{(k)}_{t})$

is in fact an equality. Indeed this assertion, at a given time $t$, is equivalent to the assertion

 $P(X^{(0)}_{t}\in\cdot,T>t)\mbox{ and }P(X^{(k)}_{t}\in\cdot,T>t)\mbox{ have % disjoint support}.$

But the support $A^{0}$ of the first measure is the set of vertices which can be reached from $0$ without meeting or crossing any mirror point (and similarly for $A^{k}$); and $A^{0}$ and $A^{k}$ are indeed disjoint.

It is intuitively clear that the minimum over $k$ of $T^{0k}$ is attained by $k=\lfloor n/2\rfloor$: we leave the reader to find the simple non-computational proof. It follows, taking e.g. the simplest case where $n$ is multiple of $4$, that we can write

 $\bar{d}(t)=P(T_{\{-n/4,n/4\}}>t)$ (12.10)

where $T_{\{-n/4,n/4\}}$ is the hitting time for continuous-time random walk on the integers.

Parallel results hold in discrete time but only when the chains are suitably lazy. The point is that (12.9) isn’t allowable as transition probabilities. However, if we fix $0 then the chain with transition probabilities

 $i\stackrel{a}{\longrightarrow}i+1;\quad i\stackrel{a}{\longrightarrow}i-1$

(and which holds with the remaining probability) permits a coupling of the form (12.9) with all transition probabilities being $a$ instead of $1/2$. The analysis goes through as above, leading to (12.10) where $T$ refers to the discrete-time lazy walk on the integers.

Similar results hold for random walk on the $n$-path (Chapter 5 Example 8) (yyy 4/23/96 version). and we call couplings of this form reflection couplings. They are simpler in the context of continuous-path Brownian motion – see Chapter 13-4 section 1 (yyy 7/29/99 version).

## 12.1.9 Card-shuffling by random adjacent transpositions

As in section 12.1.7 we take a $d$-card deck; here we define a (lazy) shuffle by

With probability $1/2$ make no change; else pick a uniform random position $i\in\{1,2,\ldots,d\}$ and interchange the cards in positions $i$ and $i+1$ (interpret $d+1$ as $1$).

To study this by coupling, consider two decks. In some positions $i$ the decks match (the label on the card in position $i$ is the same in both decks). Write ${\cal D}$ for the set of $i$ such that either position $i$ or position $i+1$ or both match. Specify a step of the coupled chain by:

 $i$ $\displaystyle=$ $\displaystyle{\textstyle\frac{1}{2d}},\quad i\in{\cal D}$ $i$ $\displaystyle=$ $\displaystyle{\textstyle\frac{1}{2d}},\quad i\not\in{\cal D}$ $i$ $\displaystyle=$ $\displaystyle{\textstyle\frac{1}{2d}},\quad i\not\in{\cal D}$ $\displaystyle P(\mbox{no change in either deck})$ $\displaystyle=$ $\displaystyle{\textstyle\frac{|{\cal D}|}{2d}}.$

Consider a particular card $a$. From the coupling description we see

(a) if the card gets matched then it stays matched;

(b) while unmatched, at each step the card can move in at most one of the decks.

It follows that the “clockwise” distance $D(t):=X_{a}^{1}(t)-X_{a}^{2}(t)\bmod d$ between the positions of card $a$ in the two decks behaves exactly as a lazy random walk on the $d$-cycle:

 $p_{j,j+1}=p_{j,j-1}=1/d,\quad 1\leq j\leq d$

until $D(t)$ hits $0$. By the elementary formula for mean hitting times on the cycle (Chapter 5 eq. (24)) (yyy 4/23/96 version), the mean time $T^{(a)}$ until card $a$ becomes matched satisfies

 $ET^{(a)}\leq{\textstyle\frac{d}{2}}\ {\textstyle\frac{d^{2}}{4}}$

uniformly over initial configurations. By submultiplicativity (Chapter 2 section 4.3) (yyy 9/10/99 version)

 $P(T^{(a)}>md^{3}/4)\leq 2^{-m},\ m=1,2,\ldots.$

The chains couple at time $T:=\max_{a}T^{(a)}$ and so

 $\bar{d}(md^{3}/4)\leq P(T>md^{3}/4)\leq d2^{-m}.$

In particular

 $\tau_{1}^{{\rm disc}}=O(d^{3}\log d).$

In this example it turns out that coupling does give the correct order of magnitude; the corresponding lower bound

 $\tau_{1}^{{\rm disc}}=\Omega(d^{3}\log d)$

was proved by Wilson [338]. Different generalizations of this example appear in section 12.1.13 and in Chapter 14 section 5 (yyy 3/10/94 version), where we discuss relaxation times.

A generalization of this example, the interchange process, is studied in Chapter 14 section 5 (yyy 3/10/94 version).

## 12.1.10 Independent sets

Fix a graph $G$ on $n$ vertices with maximal degree $r$, An independent set is a set of vertices which does not contain any adjacent vertices. Fix $m$ and consider the space of all independent sets of size $m$ in $G$. Picture an independent set ${\bf x}$ as a configuration of $m$ particles at distinct vertices, with no two particles at adjacent vertices. A natural discrete-time chain $({\bf X}_{t})$ on $I$ is

pick a uniform random particle $a$ and a uniform random vertex $v$; move particle $a$ to vertex $v$ if feasible, else make no move.

To study mixing times, we can define a coupling $({\bf X}_{t},{\bf Y}_{t})$ by simply making the same choice of $(a,v)$ in each of the two coupled chains, where at each time we invoke a matching of particles in the two realizations which is arbitrary except for matching particles at the same vertex. To analyze the coupling, let $\rho$ be the natural metric on $I$: $\rho({\bf x},{\bf y})=$ number of vertices occupied by particles of ${\bf x}$ but not by particles of ${\bf y}$. Clearly $D_{t}:=\rho({\bf X}_{t},{\bf Y}_{t})$ can change by at most $1$ on each step. Let us show that, for initial states with $\rho({\bf x},{\bf y})=d>0$,

 $\displaystyle P_{({\bf x},{\bf y})}(D_{1}=d+1)$ $\displaystyle\leq$ $\displaystyle\frac{m-d}{m}\ \frac{2d(r+1)}{n}$ (12.11) $\displaystyle P_{({\bf x},{\bf y})}(D_{1}=d-1)$ $\displaystyle\geq$ $\displaystyle\frac{d}{m}\ \frac{n-(m+d-2)(r+1)}{n}.$ (12.12)

For in order that $D_{1}=d+1$ we must first choose a matched particle $a$ (chance $(m-d)/d$) and then choose a vertex $v$ which is a neighbor of (or the same as) some vertex $v^{\prime}$ which is in exactly one of $\{{\bf x},{\bf y}\}$: there are $2d$ such vertices $v^{\prime}$ and hence at most $2d(r+1)$ possibilities for $v$. This establishes (12.11). Similarly, in order that $D_{t}=d-1$ it is sufficient that we pick an unmatched particle $a$ (chance $d/m$) and then choose a vertex $v$ which is not a neighbor of (or the same as) any vertex $v^{\prime}$ which is occupied in one or both realizations by some particle other than $a$: there are $m+d-2$ such forbidden vertices $v^{\prime}$ and hence at most $(m+d-2)(r+1)$ forbidden positions for $v$. This establishes (12.12).

From (12.11,12.12) a brief calculation gives

 $\displaystyle E_{({\bf x},{\bf y})}(D_{1}-d)$ $\displaystyle\leq$ $\displaystyle{\textstyle\frac{-d}{mn}}(n-(3m-d-2)(r+1))$ $\displaystyle\leq$ $\displaystyle{\textstyle\frac{-d}{mn}}(n-3(m-1)(r+1)).$

In other words

 $E_{({\bf x},{\bf y})}D_{1}\leq\kappa d;\quad\kappa:=1-\frac{n-3(m-1)(r+1)}{mn}.$

If $m<1+\frac{n}{3(r+1)}$ then $\kappa<1$. In this case, by copying the end of the analysis of the graph-coloring chain (section 12.1.5)

 $\bar{d}(t)\leq m\kappa^{t};\quad\tau_{1}=O\left({\textstyle\frac{\log m}{1-% \kappa}}\right).$

To clarify the size-asymptotics, suppose $m,n\to\infty$ with $m/n\to\rho<\frac{1}{3(r+1)}$. Then for fixed $\rho$

 $\tau_{1}=O(n\log n).$

## 12.1.11 Two base chains for genetic algorithms

One way of motivating study of Markov chains on combinatorial sets with uniform stationary distributions is as “base chains” on which to base Markov chain Monte Carlo, that is to create other chains designed to have some specified distribution as their stationary distributions. Here is a typical base chain underlying genetic algorithms.

Fix integers $K,L\geq 1$ with $K$ even. A state of the chain is a family of words $({\bf x}^{k},1\leq k\leq K)$, where each word is a binary $L$-tuple ${\bf x}^{k}=(x^{k}_{l},1\leq l\leq L)$. A step of the chain is defined as follows.

Use a uniform random permutation $\pi$ of $\{1,2,\ldots,K\}$ to partition the words into $K/2$ pairs $\{{\bf x}^{\pi(1)},{\bf x}^{\pi(2)}\},\quad\{{\bf x}^{\pi(3)},{\bf x}^{\pi(4)}\}\ldots$. Create a new pair $\{{\bf y}^{1},{\bf y}^{2}\}$ from $\{{\bf x}^{\pi(1)},{\bf x}^{\pi(2)}\}$ by setting, independently for each $1\leq l\leq L$

 $P((y^{1}_{l},y^{2}_{l})=(x^{\pi(1)}_{l},x^{\pi(2)}_{l}))=P((y^{1}_{l},y^{2}_{l% })=(x^{\pi(2)}_{l},x^{\pi(1)}_{l}))=1/2.$ (12.13)

Repeat independently for $1\leq i\leq K/2$ to create new pairs $\{{\bf y}^{2i-1},{\bf y}^{2i}\}$ from $\{{\bf x}^{\pi(2i-1)},{\bf x}^{\pi(2i)}\}$ . The new state is the family of words ${\bf y}^{k}$.

Associated with an initial state $({\bf x}^{k})$ is a vector of column sums ${\bf m}=(m_{l},1\leq l\leq L)$ where $m_{l}=\sum_{k}x^{k}_{l}$. These sums are preserved by the chain, so the proper state space is the space $I_{\bf m}$ of families with column-sums ${\bf m}$. The transition matrix is symmetric and so the chain is reversible with uniform stationary distribution on $I_{\bf m}$.

To describe the coupling, first rephrase (12.13) in words as “$(y^{1}_{l},y^{2}_{l})$ is $\{x^{\pi(1)}_{l},x^{\pi(2)}_{l}\}$ in random order, either forwards or backwards”. Now specify the coupling as follows.

(i) Use the same random permutation $\pi$ for both chains.

(ii) For each $i$ and each $l$, in creating the new words $(y^{2i-1}_{l},y^{2i}_{l})$ from the old words $\{x^{\pi({2i-1})}_{l},x^{\pi(2i)}_{l})\}$ use the same choice (forwards or backwards) in both chains, except when $(x^{\pi(2i-1)}_{l},x^{\pi(2i)}_{l})=(1,0)$ for one chain and $=(0,1)$ for the other chain, in which case use opposite choices of (forwards, backwards) in the two chains.

To study the coupled processes $({\bf X}(t),\hat{{\bf X}}(t))$, fix $l$ and consider the number $W(t):=\sum_{k=1}^{K}|X^{k}_{l}(t)-\hat{X}^{k}_{l}(t)|$ of words in which the $l$’th letter is not matched in the two realizations. Suppose $W(0)=w$. Consider the creation of the first two new words in each chain. The only way that the number of matches changes is when we use opposite choices of (forwards, backwards) in the two chains, in which case two new matches are created. The chance that the $l$’th letter in the two chains is $1$ and $0$ in the $\pi(1)$’th word and is $0$ and $1$ in the $\pi(2)$’th word equals $\frac{w/2}{K}\times\frac{w/2}{K-1}$, and so (taking into account the symmetric case) the mean number of new matches at $l$ in these two words equals $\frac{w^{2}}{K(K-1)}$. Summing over the $K/2$ pairs,

 $E(W(1)|W(0)=w)=w-{\textstyle\frac{w^{2}}{2(K-1)}}.$

We can now apply a comparison lemma (Chapter 2 Lemma 32) (yyy 9/10/99 version) which concludes that the hitting time $T^{l}$ of $W(t)$ to $0$ satisfies

 $ET^{l}\leq\sum_{w=2}^{K}{\textstyle\frac{2(K-1)}{w^{2}}}\leq 2K.$

Since $T:=\max_{l}T^{l}$ is a coupling time, a now-familiar argument shows that for $u=1,2,\ldots$

 $\bar{d}(4uK)\leq P(T>4uK)\leq LP(T^{l}>u\cdot 4K)\leq L\ 2^{-u}$

and so

 $\tau_{1}^{{\rm disc}}=O(K\log L).$
###### Open Problem 12.5

Show $\tau_{1}^{{\rm disc}}=O(\log K\times\log L)$.

We expect this bound by analogy with the “random transpositions” shuffle (section 12.1.7). Loosely speaking, the action of the chain on a single position in words is like the random transpositions chain speeded up by a factor $K/2$, so from (12.8) we expect its mixing time to be $\Theta(\log K)$. It would be interesting to study this example via the group representation or strong stationary time techniques which have proved successful for the random transpositions chain.

To make a metaphor involving biological genetics, the letters represent chromosomes and the words represent the chromosomal structure of a gamete; the process is “sexual reproduction from the viewpoint of gametes”. If instead we want a word to represent a particular chromosome and the letters to represent genes within that chromosome, then instead of flipping bits independently it is more natural to model crossover. That is, consider a chain in which the rule for creating a new pair $\{{\bf y}^{2i-1},{\bf y}^{2i}\}$ from $\{{\bf x}^{\pi(2i-1)},{\bf x}^{\pi(2i)}\}$ becomes

Take $U_{i}$ uniform on $\{1,2,\ldots,L,L+1\}$. Define

 $\displaystyle(y^{2i-1}_{l},y^{2i}_{l})$ $\displaystyle=$ $\displaystyle(x^{\pi(2i-1)}_{l},x^{\pi(2i)}_{l}),\ l $\displaystyle(y^{2i-1}_{l},y^{2i}_{l})$ $\displaystyle=$ $\displaystyle(x^{\pi(2i)}_{l},x^{\pi(2i-1)}_{l}),\ l\geq U.$

As an exercise (hint in Notes), find a coupling argument to show that for this chain

 $\tau_{1}^{{\rm disc}}=O(KL^{2}).$ (12.14)

## 12.1.12 Path coupling

In certain complicated settings it is useful to know that it is enough to couple versions of the chain which start in “nearby” states. To say this carefully, let $I$ be finite and consider a $\{0,1,2,\ldots\}$-valued function $\rho(i,j)$ defined on some symmetric subset $\mbox{{\cal E}}\subset I\times I$. Call $\rho$ a pre-metric if

(i) $\rho(i,j)=0\mbox{ iff }i=j$.

(ii) $\rho(i,j)=\rho(j,i)$.

(iii) $\rho(i_{0},i_{k})\leq\sum_{u=0}^{k-1}\rho(i_{u},i_{u+1})$, whenever $(i_{0},i_{k})$ and each $(i_{u},i_{u+1})$ are in ${\cal E}$.

Clearly a pre-metric extends to a metrric $\bar{\rho}$ by defining

 $\bar{\rho}(i,j):=\min\left\{\sum_{u}\rho(i_{u},i_{u+1})\right\}$ (12.15)

the minimum over all paths $i=i_{0},i_{1},\ldots,i_{k}=j$ with each $(i_{u},i_{u+1})\in\mbox{{\cal E}}$. Note $\bar{\rho}(i,j)\leq\rho(i,j)$ for $(i,j)\in\mbox{{\cal E}}$.

###### Lemma 12.6

Let $S$ be a state space. Let $(\mu_{i,i+1},\ 0\leq i\leq d-1)$ be probability distributions on $S\times S$ such that the second marginal of $\mu_{i,i+1}$ coincides with the first marginal of $\mu_{i+1,i+2}$ for $0\leq i\leq d-2$. Then there exists a $S$-valued random sequence $(V_{i},0\leq i\leq d)$ such that $\mu_{i,i+1}={\rm dist}(V_{i},V_{i+1})$ for $0\leq i\leq d-1$.

Proof. Just take $(V_{i})$ to be the non-homogeneous Markov chain whose transition probabilities $P(V_{i+1}\in\cdot|V_{i}=v)$ are the conditional probabilities determined by the specified joint distribution $\mu_{i,i+1}$.

###### Lemma 12.7 (Path-coupling lemma)

Take a discrete-time Markov chain $(X_{t})$ with finite state space $I$. Write $X^{(i)}_{1}$ for the time-$1$ value of the chain started at state $i$. Let $\rho$ be a pre-metric defined on some subset $\mbox{{\cal E}}\subset I\times I$. Suppose that for each pair $(i,j)$ in ${\cal E}$ we can construct a joint law $(X^{(i)}_{1},X^{(j)}_{1})$ such that

 $E\bar{\rho}(X^{(i)}_{1},X^{(j)}_{1})\leq\kappa\rho(i,j)$ (12.16)

for some constant $0<\kappa<1$. Then

 $\bar{d}(t)\leq\Delta_{\rho}\kappa^{t}\quad(\kappa<1)$ (12.17)

where $\Delta_{\rho}:=\max_{i,j\in I}\bar{\rho}(i,j)$.

See the Notes for comments on the case $\kappa=1$.

Proof. Fix states $i,j$ and consider a path $(i_{u})$ attaining the minimum in (12.15). For each $u$ let $(X^{(i_{u})}_{1},X^{(i_{u+1})}_{1})$ have a joint distribution satisfying (12.16). By Lemma 12.6 there exists a random sequence $(X_{1}^{(i)}=X_{1}^{(i_{0})},X_{1}^{(i_{1})},\ldots,X_{1}^{(j)})$ consistent with these bivariate distributions. In particular, there is a joint distribution $(X^{(i)}_{1},X^{(j)}_{1})$ such that

 $E\bar{\rho}(X^{(i)}_{1},X^{(j)}_{1})\leq E\sum_{u}\bar{\rho}(X^{(i_{u})}_{1},X% ^{(i_{u+1})}_{1})\leq\kappa\sum_{u}\rho(i_{u},i_{u+1})=\kappa\bar{\rho}(i,j).$

This construction gives one step of a coupling of two copies of the chain started at arbitrary states, and so extends to a coupling $((X^{(i)}_{t},X^{(j)}_{t}),\ t=0,1,2,\ldots)$ of two copies of the entire processes. The inequality above implies

 $E(\bar{\rho}(X^{(i)}_{t+1},X^{(j)}_{t+1})|X^{(i)}_{t},X^{(j)}_{t})\leq\kappa% \bar{\rho}(X^{(i)}_{t},X^{(j)}_{t})$

and hence

 $P(X^{(i)}_{t}\neq X^{(j)}_{t})\leq E\bar{\rho}(X^{(i)}_{t},X^{(j)}_{t})\leq% \kappa^{t}\bar{\rho}(i,j)\leq\kappa^{t}\Delta_{\rho}$

establishing (12.17). $\Box$

Bubley and Dyer [79] introduced Lemma 12.7 and the name path-coupling. It has proved useful in extending the range of applicability of coupling methods in settings such as graph-coloring (Bubley et al [77] Vigoda [333]) and independent sets (Luby and Vigoda [244]). These are too intricate for presentation here, but the following example will serve to illustrate the use of path-coupling.

## 12.1.13 Extensions of a partial order

Fix a partial order $\preceq$ on an $n$-element set, and let $I_{m}$ be the set of linear extensions of $\preceq$, that is to say total orders consistent with the given partial order. We can define a discrete-time Markov chain on $I_{m}$ by re-using the idea in the “random adjacent transpositions” example (section 12.1.9). Let $w(\cdot)$ be a probability distribution on $\{1,2,\ldots,n-1\}$. Define a step of the chain as follows.

Pick position $i$ with probability $w(i)$, and independently pick one of $\{$ stay, move $\}$ with probability $1/2$ each. If pick “move” then interchange the elements in positions $i$ and $i+1$ if feasible (i.e. if consistent with the partial order); else make no change.

The transition matrix is symmetric, so the stationary distribution is uniform on $I_{m}$.

To analyze by coupling, define one step of a bivariate coupled process as follows.

Make the same choice of $i$ in both chains. Also make the same choice of $\{$ move, stay $\}$, except in the case where the elements in positions $i$ and $i+1$ are the same elements in opposite order in the two realizations, in which case use the opposite choices of $\{$ stay, move $\}$.

The coupling is similar (but not identical) to that in section 12.1.9, where the underlying chain is that corresponding to the “null” partial order. For a general partial order, the coupling started from an arbitrary pair of states seems hard to analyze directly. For instance, an element in the same position in both realizations at time $t$ may not remain so at time $t+1$. Instead we use path-coupling, following an argument of Bubley and Dyer [80]. Call two states ${\bf x}$ and ${\bf y}$ adjacent if they differ by only one (not necessarily adjacent) transposition; if the transposed cards are in positions $i then let $\rho({\bf x},{\bf y})=j-i$. We want to study the increment $\Phi:=\rho(X_{1},Y_{1})-\rho({\bf x},{\bf y})$ where $(X_{1},Y_{1})$ is the coupled chain after one step from $({\bf x},{\bf y})$. The diagram shows a typical pair of adjacent states.

$\begin{array}[]{ccccccccccc}&a&b&c&\alpha&d&e&f&\beta&g&h\\ &a&b&c&\beta&d&e&f&\alpha&g&h\\ \mbox{position}&\cdot&\cdot&\cdot&i&\cdot&\cdot&\cdot&j&\cdot&\cdot\end{array}$

Observe first that any choice of position other than $i-1,i,j-1,j$ will have no effect on $\Phi$. If position $i$ and “move” are chosen, then $\{\alpha,d\}$ are interchanged in the first chain and $\{\beta,d\}$ in the second; both lead to feasible configurations by examining the relative orders in the other chain’s previous configuration. This has chance $w(i)/2$ and leads to $\Phi=-1$. If position $i-1$ and “move” are chosen (chance $w(i-1)/2$), then if either or both moves are feasible $\Phi=1$, while if neither are feasible then $\Phi=0$. Arguing similarly for choices $j-1,j$ leads to

 $E(\Phi)\leq{\textstyle\frac{1}{2}}(w(i-1)-w(i)-w(j-1)+w(j)).$

This estimate remains true if $j=i+1$ because in that case choosing position $i$ (chance $w(i)$) always creates a match. Now specify

 $w(i):={\textstyle\frac{i(n-i)}{w_{n}}},\quad w_{n}:=\sum_{j=1}^{n-1}j(n-j)$

and then $E\Phi\leq-\frac{j-i}{w_{n}}$. This leads to

 $E_{({\bf x},{\bf y})}\rho(X_{1},Y_{1})\leq\left(1-{\textstyle\frac{1}{w_{n}}}% \right)\rho({\bf x},{\bf y})$

for adjacent $({\bf x},{\bf y})$. We are thus in the setting of Lemma 12.7, which shows

 $\bar{d}(t)\leq\Delta_{n}\exp(-t/w_{n}).$

Since $\Delta_{n}=O(n^{2})$ and $w_{n}\sim n^{3}/6$ we obtain

 $\tau_{1}^{{\rm disc}}=({\textstyle\frac{1}{3}}+o(1))\ n^{3}\log n.$