Recall from Chapter 2 section 4.1 (yyy 9/10/99 version) several ways in which variation distance is used to measure the deviation from stationarity of the time- distribution of a Markov chain:
Recall also from yyy the definition of variation threshold time
Since is affected by continuization, when using the above definition with a discrete-time chain we write for emphasis.
Coupling provides a methodology for seeking to upper bound and hence . After giving the (very simple) theory in section 12.1.1 and some discussion in section 12.1.2, we proceed to give a variety of examples of its use. We shall say more about theory and applications of coupling in Chapter 8 (where we discuss the related idea of coupling from the past) and in Chapter 10 (on interacting random walks).
Consider a finite-state chain in discrete or continuous time. Fix states . Suppose we construct a coupling, that is a joint process such that
is distributed as the chain started at | |||
(12.1) |
And suppose there is a random time such that
(12.2) |
Call such a a coupling time. Then the coupling inequality is
(12.3) |
The inequality holds because
The coupling inequality provides a method of bounding the variation distance , because if we can construct a coupling for an arbitrary pair of initial states then
The reader may wish to look at a few of the examples before reading this section in detail.
In applying coupling methodology there are two issues. First we need to specify the coupling, then we need to analyze the coupling time. The most common strategy for constructing couplings is via Markov couplings, as follows. Suppose the underlying chain has state space and (to take the continuous-time case) transition rate matrix . Consider a transition rate matrix on the product space . Write the entries of as instead of the logical-but-fussy . Suppose that, for each pair with ,
(12.4) |
in other words and . And suppose that
Take to be the chain on with transition rate matrix and initial position , Then (12.1) must hold, and is a coupling time. This construction gives a natural Markov coupling, and all the examples where we use the coupling inequality will be of this form. In practice it is much more understandable to define the joint process in words, and we usually do so.
In constructing and analyzing couplings, we often exploit (explicitly or implicitly) some integer-valued metric on the state space . Then with a Markovian coupling,
and it is enough to study the integer-valued process . Typically is not Markov, but one can try to compare it with some integer-valued Markov process . Indeed, in defining the coupling one has in mind trying to make such a comparison possible. Often one shows that for any initial the random time is stochastically smaller than the hitting time for the comparison chain to reach starting from . This would imply
Finally, one does calculations with the integer-valued chain , either bounding the tail probability directly or (what is often simpler) just bounding the expectation , so that by Markov’s inequality and the submultiplicativity property (Chapter 2 Lemma 20) (yyy 9/10/99 version) we have in continuous time
Here is perhaps the simplest comparison lemma, whose proof is left to the reader.
Let be a Markov chain on and
a function.
Suppose that for each and each initial state with
,
(i) ;
(ii) .
Then
where .
We now start a series of examples. Note that when presenting a coupling proof we don’t need to explicitly check irreducibility, because the conclusion of a bound on coupling time obviously implies irreducibility.
(Chapter 5 Example 16).
Consider an -regular -vertex graph. Write for the set of neighbors of . For any pair we can define a map such that for . Consider discrete-time random walk on the graph. We define a “greedy coupling” by specifying the joint transition matrix
That is, from vertices and , if the first chain jumps to then the second chain jumps to , and we maximize the chance of the two chains meeting after a single step. In general one cannot get useful bounds on the coupling time. But consider the dense case, where . As observed in Chapter 5 Example 16, here and so the coupled process has the property that for
implying that the coupling time (for any initial pair of states) satisfies
So the coupling inequality implies . In particular the variation threshold satisfies
(Chapter 5 Example 15).
For and in , let be the set of coordinates where and differ. Write for the state obtained by changing the ’th coordinate of . Recall that in continuous time the components move independently as -state chains with transition rates . Define a coupling in words as “run unmatched coordinates independently until they match, and then run them together”. Formally, the non-zero transitions of the joint process are
For each coordinate which is initially unmatched, it takes exponential (rate ) time until it is matched, and so the coupling time satisfies
where the are independent exponential (rate ) and is the initial number of unmatched coordinates. So
and the coupling inequality bounds variation distance as
This leads to an upper bound on the variation threshold time
This example is discussed in more detail in Chapter 5 Example 15 (yyy 4/23/96 version) where it is shown that
so the coupling bound is off by a factor of .
Fix a -vertex graph with maximal degree . Fix an integer and consider as a set of colors. Let be the set of -colorings of , where a -coloring is an assignment of a color to each vertex, in such a way that no two adjacent vertices have the same color. One can put a natural “product graph” structure on , in which two colorings are adjacent if they differ at only one vertex. It is not hard to check that the condition ensures that is non-empty and the associated graph is connected. There is a natural discrete-time Markov chain on :
Pich a vertex of uniformly at random, pick a color uniformly at random, assign color to vertex if feasible (i.e. if no neighbor of has color ), else retain existing color of .
Under certain conditions a simple coupling analysis succeeds in bounding the mixing time. (The bound is far from sharp – see Notes).
If then and so .
Proof. We couple two versions of the chain by simply using the same and in both chains at each step. Write for the number of vertices at which the colors in the two chains differ. Then and the key estimate is the following.
Conditional on the state of the coupled process at time ,
(12.5) | |||||
(12.6) |
Proof. In order that it is necessary that the chosen pair is such that
(*) there exists a neighbor (, say) of such that has color in one chain but not in the other chain.
But the total number of pairs equals while the number of pairs satisfying (*) is at most . This establishes (12.5). Similarly, for it is sufficient that is currently unmatched and that no neighbor of in either chain has color ; the number of such pairs is at least .
Lemma 12.3 implies and so
Since we have, for any initial pair of states,
and the coupling lemma establishes the Proposition.
The examples in sections 12.1.4 and 12.1.5 were simple prototypes of interacting particle systems, more examples of which appear in Chapter 10, whose characteristic property is that a step of the chain involves only “local” change. Chains making “global” changes are often hard to analyze, but here is a simple example.
Fix a finite alphabet of size . Fix , and consider the set of “words” with each . Consider the discrete-time Markov chain on in which a step is specified by the following two-stage procedure.
Stage 1. Pick a permutation of uniformly at random from the set of permutations satisfying .
Stage 2. Let be the cycles of . For each , and independently as varies, pick uniformly an element of , and define for every .
Here is an alternative description. Write for the set of permutations of . Consider the bipartite graph on vertices with edge-set . Then the chain is random walk on this bipartite graph, watched every second step (that is, when it is in ).
From the second description, it is clear that the stationary probabilities are proportional to the degree of in the bipartite graph, giving
where . We shall use a coupling argument to establish the following bound on variation distance:
(12.7) |
implying that the variation threshold satisfies
The construction of the coupling depends on the following lemma.
Given finite sets we can construct (for ) a uniform random permutation of with cycles , where the cycles are labeled such that
In the equality we interpret the as sets.
Proof. Given a permutation of a finite set , there is an induced permutation on a subset obtained by deleting from the cycle representation of those elements not in . It is easy to check that, for a uniform random permutation of , the induced random permutation of is also uniform. In the setting of the lemma, take a uniform random permutation of , and let be the induced random permutations of . Then the equality holds because each side is representing the cycles of the induced permutation on .
We construct a step of the coupled processes as follows. For each , set . Take random permutations as in the lemma, with cycles . Then define a uniform random permutation of , and similarly for . This completes stage 1. For stage 2, for each pair pick a uniform random element of and set
This specifies a Markov coupling. By construction
because and are independent uniform choices from . So the coupled processes satisfy
In particular and the coupling inequality (12.3) gives (12.7).
We mentioned in Chapter 1 section 1.4 (yyy 7/20/99 version) that card-shuffling questions provided a natural extrinsic motivation for the study of mixing times. The example here and in section 12.1.9 give a first study of mathematically (if not physically) simple random shuffles, and these discrete-time chains are prototypes for more complex chains arising in other contexts.
Consider a -card deck. The random transpositions shuffle is:
Make two independent uniform choices of cards, and interchange them.
With chance the two choices are the same card, so no change results. To make a coupling analysis, we first give an equivalent reformulation.
Pick a label and a position uniformly at random; interchange the label- card with the card in position .
This reformulation suggests the coupling in which the same
choice of is used for each chain.
In the coupled process (with two arbitrary starting states)
let be the number of unmatched cards
(that is, cards whose positions in the two decks are different)
after steps.
Then
(i) .
(ii) .
Here (i) is clear, and (ii) holds because whenever the card labeled and the card in position are both unmatched, the step of the coupled chain creates at least one new match (of the card labeled ).
Noting that cannot take value , we can use the decreasing functional lemma (Lemma 12.1) to show that the coupling time satisfies
In particular, the coupling inequality implies .
We revisit this example in Chapter 7 Example 18 (yyy 1/31/94 version) where it is observed that in fact
(12.8) |
An analogous continuous-space chain on the simplex is studied in Chapter 13-4 Example 3 (yyy 7/29/99 version)
Consider continuous-time random walk on the -cycle . That is, the transition rates are
where here and below is interpreted modulo . One can define a coupling by specifying the following transition rates for the bivariate process.
(if ) | |||||
(12.9) |
and symmetrically for . The joint process started at can be visualized as follows. Let . Picture the operation of as reflection in a mirror which passes through the points each of which is either a vertex or the middle of an edge. In the simplest case, where and are vertices, let be the chain started at vertex , let and define
This constructs a bivariate process with the transition rates specified above, with coupling time , and the pre- path of is just the reflection of the pre- path of . In the case where a mirror point is the middle of an edge and the two moving particles are at and , we don’t want simultaneous jumps across that edge; instead (12.9) specifies that attempted jumps occur at independent times, and the process is coupled at the time of the first such jump.
It’s noteworthy that in this example the coupling inequality
is in fact an equality. Indeed this assertion, at a given time , is equivalent to the assertion
But the support of the first measure is the set of vertices which can be reached from without meeting or crossing any mirror point (and similarly for ); and and are indeed disjoint.
It is intuitively clear that the minimum over of is attained by : we leave the reader to find the simple non-computational proof. It follows, taking e.g. the simplest case where is multiple of , that we can write
(12.10) |
where is the hitting time for continuous-time random walk on the integers.
Parallel results hold in discrete time but only when the chains are suitably lazy. The point is that (12.9) isn’t allowable as transition probabilities. However, if we fix then the chain with transition probabilities
(and which holds with the remaining probability) permits a coupling of the form (12.9) with all transition probabilities being instead of . The analysis goes through as above, leading to (12.10) where refers to the discrete-time lazy walk on the integers.
Similar results hold for random walk on the -path (Chapter 5 Example 8) (yyy 4/23/96 version). and we call couplings of this form reflection couplings. They are simpler in the context of continuous-path Brownian motion – see Chapter 13-4 section 1 (yyy 7/29/99 version).
As in section 12.1.7 we take a -card deck; here we define a (lazy) shuffle by
With probability make no change; else pick a uniform random position and interchange the cards in positions and (interpret as ).
To study this by coupling, consider two decks. In some positions the decks match (the label on the card in position is the same in both decks). Write for the set of such that either position or position or both match. Specify a step of the coupled chain by:
Consider a particular card .
From the coupling description we see
(a) if the card gets matched then it stays matched;
(b) while unmatched, at each step the card can move in at most one of the
decks.
It follows that the “clockwise” distance
between the positions of card in the two decks behaves exactly as a
lazy random walk on the -cycle:
until hits . By the elementary formula for mean hitting times on the cycle (Chapter 5 eq. (24)) (yyy 4/23/96 version), the mean time until card becomes matched satisfies
uniformly over initial configurations. By submultiplicativity (Chapter 2 section 4.3) (yyy 9/10/99 version)
The chains couple at time and so
In particular
In this example it turns out that coupling does give the correct order of magnitude; the corresponding lower bound
was proved by Wilson [338]. Different generalizations of this example appear in section 12.1.13 and in Chapter 14 section 5 (yyy 3/10/94 version), where we discuss relaxation times.
A generalization of this example, the interchange process, is studied in Chapter 14 section 5 (yyy 3/10/94 version).
Fix a graph on vertices with maximal degree , An independent set is a set of vertices which does not contain any adjacent vertices. Fix and consider the space of all independent sets of size in . Picture an independent set as a configuration of particles at distinct vertices, with no two particles at adjacent vertices. A natural discrete-time chain on is
pick a uniform random particle and a uniform random vertex ; move particle to vertex if feasible, else make no move.
To study mixing times, we can define a coupling by simply making the same choice of in each of the two coupled chains, where at each time we invoke a matching of particles in the two realizations which is arbitrary except for matching particles at the same vertex. To analyze the coupling, let be the natural metric on : number of vertices occupied by particles of but not by particles of . Clearly can change by at most on each step. Let us show that, for initial states with ,
(12.11) | |||||
(12.12) |
For in order that we must first choose a matched particle (chance ) and then choose a vertex which is a neighbor of (or the same as) some vertex which is in exactly one of : there are such vertices and hence at most possibilities for . This establishes (12.11). Similarly, in order that it is sufficient that we pick an unmatched particle (chance ) and then choose a vertex which is not a neighbor of (or the same as) any vertex which is occupied in one or both realizations by some particle other than : there are such forbidden vertices and hence at most forbidden positions for . This establishes (12.12).
One way of motivating study of Markov chains on combinatorial sets with uniform stationary distributions is as “base chains” on which to base Markov chain Monte Carlo, that is to create other chains designed to have some specified distribution as their stationary distributions. Here is a typical base chain underlying genetic algorithms.
Fix integers with even. A state of the chain is a family of words , where each word is a binary -tuple . A step of the chain is defined as follows.
Use a uniform random permutation of to partition the words into pairs . Create a new pair from by setting, independently for each
(12.13) Repeat independently for to create new pairs from . The new state is the family of words .
Associated with an initial state is a vector of column sums where . These sums are preserved by the chain, so the proper state space is the space of families with column-sums . The transition matrix is symmetric and so the chain is reversible with uniform stationary distribution on .
To describe the coupling, first rephrase (12.13) in words
as
“ is
in random order, either forwards or backwards”.
Now specify the coupling as follows.
(i) Use the same random permutation for both chains.
(ii) For each and each , in creating the new words
from the old words
use the same choice (forwards or backwards) in both chains,
except when
for one chain
and for the other chain, in which case use opposite
choices of (forwards, backwards) in the two chains.
To study the coupled processes , fix and consider the number of words in which the ’th letter is not matched in the two realizations. Suppose . Consider the creation of the first two new words in each chain. The only way that the number of matches changes is when we use opposite choices of (forwards, backwards) in the two chains, in which case two new matches are created. The chance that the ’th letter in the two chains is and in the ’th word and is and in the ’th word equals , and so (taking into account the symmetric case) the mean number of new matches at in these two words equals . Summing over the pairs,
We can now apply a comparison lemma (Chapter 2 Lemma 32) (yyy 9/10/99 version) which concludes that the hitting time of to satisfies
Since is a coupling time, a now-familiar argument shows that for
and so
Show .
We expect this bound by analogy with the “random transpositions” shuffle (section 12.1.7). Loosely speaking, the action of the chain on a single position in words is like the random transpositions chain speeded up by a factor , so from (12.8) we expect its mixing time to be . It would be interesting to study this example via the group representation or strong stationary time techniques which have proved successful for the random transpositions chain.
To make a metaphor involving biological genetics, the letters represent chromosomes and the words represent the chromosomal structure of a gamete; the process is “sexual reproduction from the viewpoint of gametes”. If instead we want a word to represent a particular chromosome and the letters to represent genes within that chromosome, then instead of flipping bits independently it is more natural to model crossover. That is, consider a chain in which the rule for creating a new pair from becomes
Take uniform on . Define
As an exercise (hint in Notes), find a coupling argument to show that for this chain
(12.14) |
In certain complicated settings it is useful to know
that it is enough to couple versions of the chain which
start in “nearby” states.
To say this carefully, let be finite and consider a -valued
function defined on some symmetric subset .
Call a pre-metric if
(i) .
(ii) .
(iii) ,
whenever and each are in .
Clearly a pre-metric extends to a metrric by defining
(12.15) |
the minimum over all paths with each . Note for .
Let be a state space. Let be probability distributions on such that the second marginal of coincides with the first marginal of for . Then there exists a -valued random sequence such that for .
Proof. Just take to be the non-homogeneous Markov chain whose transition probabilities are the conditional probabilities determined by the specified joint distribution .
Take a discrete-time Markov chain with finite state space . Write for the time- value of the chain started at state . Let be a pre-metric defined on some subset . Suppose that for each pair in we can construct a joint law such that
(12.16) |
for some constant . Then
(12.17) |
where .
See the Notes for comments on the case .
Proof. Fix states and consider a path attaining the minimum in (12.15). For each let have a joint distribution satisfying (12.16). By Lemma 12.6 there exists a random sequence consistent with these bivariate distributions. In particular, there is a joint distribution such that
This construction gives one step of a coupling of two copies of the chain started at arbitrary states, and so extends to a coupling of two copies of the entire processes. The inequality above implies
and hence
establishing (12.17).
Bubley and Dyer [79] introduced Lemma 12.7 and the name path-coupling. It has proved useful in extending the range of applicability of coupling methods in settings such as graph-coloring (Bubley et al [77] Vigoda [333]) and independent sets (Luby and Vigoda [244]). These are too intricate for presentation here, but the following example will serve to illustrate the use of path-coupling.
Fix a partial order on an -element set, and let be the set of linear extensions of , that is to say total orders consistent with the given partial order. We can define a discrete-time Markov chain on by re-using the idea in the “random adjacent transpositions” example (section 12.1.9). Let be a probability distribution on . Define a step of the chain as follows.
Pick position with probability , and independently pick one of stay, move with probability each. If pick “move” then interchange the elements in positions and if feasible (i.e. if consistent with the partial order); else make no change.
The transition matrix is symmetric, so the stationary distribution is uniform on .
To analyze by coupling, define one step of a bivariate coupled process as follows.
Make the same choice of in both chains. Also make the same choice of move, stay , except in the case where the elements in positions and are the same elements in opposite order in the two realizations, in which case use the opposite choices of stay, move .
The coupling is similar (but not identical) to that in section 12.1.9, where the underlying chain is that corresponding to the “null” partial order. For a general partial order, the coupling started from an arbitrary pair of states seems hard to analyze directly. For instance, an element in the same position in both realizations at time may not remain so at time . Instead we use path-coupling, following an argument of Bubley and Dyer [80]. Call two states and adjacent if they differ by only one (not necessarily adjacent) transposition; if the transposed cards are in positions then let . We want to study the increment where is the coupled chain after one step from . The diagram shows a typical pair of adjacent states.
Observe first that any choice of position other than will have no effect on . If position and “move” are chosen, then are interchanged in the first chain and in the second; both lead to feasible configurations by examining the relative orders in the other chain’s previous configuration. This has chance and leads to . If position and “move” are chosen (chance ), then if either or both moves are feasible , while if neither are feasible then . Arguing similarly for choices leads to
This estimate remains true if because in that case choosing position (chance ) always creates a match. Now specify
and then . This leads to
for adjacent . We are thus in the setting of Lemma 12.7, which shows
Since and we obtain