12 Coupling Theory and Examples (October 11, 1999)

12.2 Notes on Chapter 4-3

Coupling has become a standard tool in probability theory. The monograph of Lindvall [234] contains an extensive treatment, and history. In brief, Doeblin [126] used the idea of running two copies of the chain independently until they meet, in order to prove the convergence theorem (Chapter 2 Theorem 2) for finite-state chains, and this is now the textbook proof ([270] Theorem 1.8.3) of the convergence theorem for countable-state chains. The first wide-ranging applications were in the context of infinite-site interacting particles in the 1970s, where (e.g. Liggett [231]) couplings were used to study uniqueness of invariant distributions and convergence thereto. Theory connecting couplings and variation distance is implicit in Griffeath [172] and Pitman [282], though the first systematic use to bound variation distance in finite-state chains was perhaps Aldous [15], where examples including those in sections 12.1.4, 12.1.7 and 12.1.9 were given.

Section 12.1.2. There may exist Markov couplings which are not of the natural form (12.4), but examples typically rely on very special symmetry properties. For the theoretically-interesting notion of (non-Markov) maximal coupling see Chapter 9 section 1 (yyy 4/21/95 version).

The coupling inequality is often presented using a first chain started from an arbitrary point and a second chain started with the stationary distribution, leading to a bound on d(t)d(t) instead of d¯(t)\bar{d}(t). See Chapter 13-4 yyy for an example where this is used in order to exploit distributional properties of the stationary chain.

Section 12.1.5. This chain was first studied by Jerrum [201], who proved rapid mixing under the weaker assumption c2rc\geq 2r. His proof involved a somewhat more careful analysis of the coupling, exploiting the fact that “bad” configurations for the inequalities (12.5,12.6) are different. This problem attracted interest because the same constraint c2rc\geq 2r appears in proofs of the absence of phase transition in the zero-temperature anti-ferromagnetic Potts model in statistical physics. Proving rapid mixing under weaker hypotheses was first done by Bubley et al [77] in special settings and using computer assistance. Vigoda [333] then showed that rapid mixing still holds when c>116rc>\frac{11}{6}r: the proof first studies a different chain (still reversible with uniform stationary distribution) and then uses a comparison theorem.

Section 12.1.6. The chain here was suggested by Jerrum [198] in the context of a general question of counting the number of orbits of a permutation group acting on words. More general cases (using a subgroup of permutations instead of the whole permutation group) remain unanalyzed.

Section 12.1.10. See Luby and Vigoda [244] for more detailed study and references.

Section 12.1.11. Conceptually, the states in these examples are unordered families of words. In genetic algorithms for optimization one has an objective function f:{0,1}LRf:\{0,1\}^{L}\to R and accepts or rejects offspring words with probabilities depending on their ff-values.

Interesting discussion of some different approaches to genetics and computation is in Rabani et al [287].

Hint for (12.14). First match the LL’th letters in each word, using the occasions when Ui=LU_{i}=L or L+1L+1. This takes O(LK)O(LK) time.

Section 12.1.12. Another setting where path-coupling has been used is contingency tables: Dyer and Greenhill [137].

In the case where (12.16) holds for κ=1\kappa=1, one might expect a bound of the form

d¯(t)=O(Δρ2/α)\bar{d}(t)=O(\Delta_{\rho}^{2}/\alpha) (12.18)
α:=min(i,j)P(ρ(X1(i),X1(j))ρ(i,j)-1).\alpha:=\min_{(i,j)\in\mbox{${\cal E}$}}P(\rho(X_{1}^{(i)},X_{1}^{(j)})\leq% \rho(i,j)-1).

by arguing that, for arbitrary (i,k)(i,k), the process ρ(Xt(i),Xt(k))\rho(X^{(i)}_{t},X^{(k)}_{t}) can be compared to a mean-zero random walk with chance α\alpha of making a negative step. Formalizing this idea seems subtle. Consider three states i,j,ki,j,k with (i,j)(i,j)\in\mbox{${\cal E}$} and (j,k)(j,k)\in\mbox{${\cal E}$}. Suppose

P(ρ(X1(i),X1(j))=ρ(i,j)+1)=P(ρ(X1(i),X1(j))=ρ(i,j)-1)=αP(\rho(X^{(i)}_{1},X^{(j)}_{1})=\rho(i,j)+1)=P(\rho(X^{(i)}_{1},X^{(j)}_{1})=% \rho(i,j)-1)=\alpha

and otherwise ρ(,)\rho(\cdot,\cdot) is unchanged; similarly for (j,k)(j,k). The changes for the (i,j)(i,j) process and for the (j,k)(j,k) process will typically be dependent, and in the extreme case we might have

ρ(X1(i),X1(j))=ρ(i,j)+1 iff ρ(X1(j),X1(k))=ρ(j,k)-1\rho(X^{(i)}_{1},X^{(j)}_{1})=\rho(i,j)+1\mbox{ iff }\rho(X^{(j)}_{1},X^{(k)}_% {1})=\rho(j,k)-1

and symmetrically, in which case ρ(Xt(i),Xt(k))\rho(X^{(i)}_{t},X^{(k)}_{t}) might not change at all. Thus proving a result like (12.18) must require further assumptions.

Section 12.1.13. The Markov chain here (with uniform weights) was first studied by Karzanov and Khachiyan [211].