A major theme of this book is quantifying the convergence theorem (Theorem 2.2) to give estimates of how close the distribution of a chain is to the stationary distribution at finite times. Such quantifications require some explicit choice of “distance” between distributions, and two of the simplest choices are explained in this section. We illustrate with a trivial
Rain or shine?
Suppose the true probability of rain tomorrow is 80% whereas we think the probability is 70%. How far off are we? In other words, what is the “distance” between and , where
Different notions of distance will give different numerical answers. Our first notion abstracts the idea that the “additive error” in this example is .
Perhaps the simplest notion of distance between probability distributions is variation distance, defined as
So variation distance is just the maximum additive error one can make, in using the “wrong” distribution to evaluate the probability of an event. In example 2.18, variation distance is . Several equivalent definitions are provided by
For probability distributions on a finite state space ,
the minimum taken over random pairs such that has distribution (). So each of these quantities equals the variation distance .
Proof. The first three equalities are clear. For the fourth, set . Then
with equality when . This, and the symmetric form, establish the fourth equality. In the final equality, the “” follows from
And equality is attained by the following joint distribution. Let and let
(If the denominator is zero, then and the result is trivial.)
In the context of Markov chains we may use
(2.13) |
as a measure of deviation from stationarity at time , for the chain started at state . Also define
(2.14) |
as the worst-case deviation from stationarity. Finally, it is technically convenient to introduce also
(2.15) |
In Chapter 4 we discuss, for reversible chains, relations between these “variation distance” notions and other measures of closeness-to-stationarity, and discuss parameters measuring “time until becomes small” and their relation to other parameters of the chain. For now, let’s just introduce a fundamental technical fact, the submultiplicativity property.
(a)
[the submultiplicativity property].
(b) .
(c)
(d) and decrease as increases.
Proof. We use the characterization of variation distance as
(2.16) |
the minimum taken over random pairs such that has distribution ().
Fix states and times , and let denote the chains started at , respectively. By (2.16) we can construct a joint distribution for such that
Now for each pair , we can use (2.16) to construct a joint distribution for given with the property that
The right side is if , and otherwise is at most . So unconditionally
and (2.16) establishes part (a) of the lemma. For part (b), the same argument (with now being the stationary chain) shows
(2.17) |
so that (b) will follow from the upper bound in (c). But this upper bound is clear from the triangle inequality for variation distance. And the lower bound in (c) follows from the fact that is a convex function, so that averaging over with respect to in (2.15) can only decrease distance. Finally, the “decreasing” property for follows from (a), and for follows from (2.17).
The assertions of this section hold in either discrete or continuous time. But note that the numerical value of changes when we switch from a discrete-time chain to the continuized chain. In particular, for a discrete-time chain with period we have as (which incidently implies, taking , that the factor in Lemma 2.20(b) cannot be reduced) whereas for the continuized chain .
One often sees slightly disguised corollaries of the submultiplicativity property in the literature. The following is a typical one.
Suppose there exists a probability measure , a real and a time such that
Then
Proof. The hypothesis implies , by the third equality in Lemma 2.19, and then the conclusion follows by submultiplicativity.
Another notion of distance, which is less intuitively natural but often more mathematically tractable, is distance. This is defined with respect to some fixed reference probability distribution on , which for our purposes will be the stationary distribution of some irreducible chain under consideration (and so ). The norm of a function is
(2.18) |
We define the norm of a signed measure on by
(2.19) |
This may look confusing, because a signed measure and a function are in a sense “the same thing”, being determined by values or which can be chosen arbitrarily. But the measure can also be determined by its density function , and so (2.18) and (2.19) say that the norm of a signed measure is defined to be the norm of its density function.
So is the “” measure of distance between probability distributions . In particular, the distance between and the reference distribution is
In Example 2.18 we find .
Writing for the distribution at time of a chain with stationary distribution , it is true (cf. Lemma 2.20(d) for variation distance) that is decreasing with . Since there is a more instructive proof in the reversible case (Chapter 3 Lemma 23) (yyy 9/2/94 version) we won’t prove the general case (see Notes).
Analogous to the norms are the norms
The Cauchy-Schwarz inequality gives . Note that in the definition of the reference measure has “cancelled out”. Lemma 2.19 shows that for probability measures the distance is the same as variation distance, up to a factor of :
As a trivial example in the Markov chain setting, consider
Take , fix a parameter and define a transition matrix
In this example the -step transition probabilities are
and the stationary distribution is uniform. We calculate (for arbitrary )
The submultiplicative property of is one instance of a general principle:
because our state space is finite, many quantities which converge to zero as must converge exponentially fast, by iterating over worst-case initial states.
Here’s another instance, tails of hitting time distributions.
Consider the first hitting time on a subset . Define . For any initial distribution , any time and any integer ,
So by induction on
implying
In continuous time, a good (asymptotically optimal) choice of is , giving the exponential tail bound
(2.20) |
A messier bound holds in discrete time, where we have to choose to be an integer.