2 General Markov Chains (September 10, 1999)

2.4 Two metrics on distributions

A major theme of this book is quantifying the convergence theorem (Theorem 2.2) to give estimates of how close the distribution of a chain is to the stationary distribution at finite times. Such quantifications require some explicit choice of “distance” between distributions, and two of the simplest choices are explained in this section. We illustrate with a trivial

Example 2.18

Rain or shine?

Suppose the true probability of rain tomorrow is 80% whereas we think the probability is 70%. How far off are we? In other words, what is the “distance” between π\pi and θ\theta, where

π(rain)=0.8,  π(shine)=0.2\pi(\mbox{rain})=0.8,\ \pi(\mbox{shine})=0.2
θ(rain)=0.7,  θ(shine)=0.3.\theta(\mbox{rain})=0.7,\ \theta(\mbox{shine})=0.3.

Different notions of distance will give different numerical answers. Our first notion abstracts the idea that the “additive error” in this example is 0.8-0.7=0.10.8-0.7=0.1.

2.4.1 Variation distance

Perhaps the simplest notion of distance between probability distributions is variation distance, defined as

||θ1-θ2||=maxAI|θ1(A)-θ2(A)|.||\theta_{1}-\theta_{2}||=\max_{A\subseteq I}|\theta_{1}(A)-\theta_{2}(A)|.

So variation distance is just the maximum additive error one can make, in using the “wrong” distribution to evaluate the probability of an event. In example 2.18, variation distance is 0.10.1. Several equivalent definitions are provided by

Lemma 2.19

For probability distributions θ1,θ2\theta_{1},\theta_{2} on a finite state space II,

12i|θ1(i)-θ2(i)|\displaystyle\frac{1}{2}\sum_{i}|\theta_{1}(i)-\theta_{2}(i)| =\displaystyle= i(θ1(i)-θ2(i))+\displaystyle\sum_{i}(\theta_{1}(i)-\theta_{2}(i))^{+}
=\displaystyle= i(θ1(i)-θ2(i))-\displaystyle\sum_{i}(\theta_{1}(i)-\theta_{2}(i))^{-}
=\displaystyle= 1-imin(θ1(i),θ2(i))\displaystyle 1-\sum_{i}\min(\theta_{1}(i),\theta_{2}(i))
=\displaystyle= maxAI|θ1(A)-θ2(A)|\displaystyle\max_{A\subseteq I}|\theta_{1}(A)-\theta_{2}(A)|
=\displaystyle= minP(V1V2)\displaystyle\min P(V_{1}\neq V_{2})

the minimum taken over random pairs (V1,V2)(V_{1},V_{2}) such that VmV_{m} has distribution θm\theta_{m} (m=1,2m=1,2). So each of these quantities equals the variation distance ||θ1-θ2||||\theta_{1}-\theta_{2}||.

Proof. The first three equalities are clear. For the fourth, set B={i:θ1(i)>θ2(i)}B=\{i:\theta_{1}(i)>\theta_{2}(i)\}. Then

θ1(A)-θ2(A)\displaystyle\theta_{1}(A)-\theta_{2}(A) =\displaystyle= iA(θ1(i)-θ2(i))\displaystyle\sum_{i\in A}(\theta_{1}(i)-\theta_{2}(i))
\displaystyle\leq iAB(θ1(i)-θ2(i))\displaystyle\sum_{i\in A\cap B}(\theta_{1}(i)-\theta_{2}(i))
\displaystyle\leq iB(θ1(i)-θ2(i))\displaystyle\sum_{i\in B}(\theta_{1}(i)-\theta_{2}(i))
=\displaystyle= i(θ1(i)-θ2(i))+\displaystyle\sum_{i}(\theta_{1}(i)-\theta_{2}(i))^{+}

with equality when A=BA=B. This, and the symmetric form, establish the fourth equality. In the final equality, the “\leq” follows from

|θ1(A)-θ2(A)|=|P(V1A)-P(V2A)|P(V2V1).|\theta_{1}(A)-\theta_{2}(A)|=|P(V_{1}\in A)-P(V_{2}\in A)|\leq P(V_{2}\neq V_% {1}).

And equality is attained by the following joint distribution. Let θ(i)=min(θ1(i),θ2(i))\theta(i)=\min(\theta_{1}(i),\theta_{2}(i)) and let

P(V1=i,V2=i)=θ(i)P(V_{1}=i,V_{2}=i)=\theta(i)
P(V1=i,V2=j)=(θ1(i)-θ(i))(θ2(j)-θ(j))1-kθ(k),ij.P(V_{1}=i,V_{2}=j)=\frac{(\theta_{1}(i)-\theta(i))(\theta_{2}(j)-\theta(j))}{1% -\sum_{k}\theta(k)},\ i\neq j.

(If the denominator is zero, then θ1=θ2\theta_{1}=\theta_{2} and the result is trivial.) \Box

In the context of Markov chains we may use

di(t):=||Pi(Xt=)-π()||d_{i}(t):=||P_{i}(X_{t}=\cdot)-\pi(\cdot)|| (2.13)

as a measure of deviation from stationarity at time tt, for the chain started at state ii. Also define

d(t):=maxidi(t)d(t):=\max_{i}d_{i}(t) (2.14)

as the worst-case deviation from stationarity. Finally, it is technically convenient to introduce also

d¯(t)=maxij||Pi(Xt=)-Pj(Xt=)||.\bar{d}(t)=\max_{ij}||P_{i}(X_{t}=\cdot)-P_{j}(X_{t}=\cdot)||. (2.15)

In Chapter 4 we discuss, for reversible chains, relations between these “variation distance” notions and other measures of closeness-to-stationarity, and discuss parameters τ\tau measuring “time until d(t)d(t) becomes small” and their relation to other parameters of the chain. For now, let’s just introduce a fundamental technical fact, the submultiplicativity property.

Lemma 2.20


(a) d¯(s+t)d¯(s)d¯(t),  s,t0\bar{d}(s+t)\leq\bar{d}(s)\bar{d}(t),\ s,t\geq 0 [the submultiplicativity property].

(b) d(s+t)2d(s)d(t),  s,t0d(s+t)\leq 2d(s)d(t),\ s,t\geq 0 .

(c) d(t)d¯(t)2d(t),  t0.d(t)\leq\bar{d}(t)\leq 2d(t),\ t\geq 0.

(d) d(t)d(t) and d¯(t)\bar{d}(t) decrease as tt increases.

Proof. We use the characterization of variation distance as

||θ1-θ2||=minP(V1V2),||\theta_{1}-\theta_{2}||=\min P(V_{1}\neq V_{2}), (2.16)

the minimum taken over random pairs (V1,V2)(V_{1},V_{2}) such that VmV_{m} has distribution θm\theta_{m} (m=1,2m=1,2).

Fix states i1,i2i_{1},i_{2} and times s,ts,t, and let 𝐘1,𝐘2{\bf Y}^{1},{\bf Y}^{2} denote the chains started at i1i_{1}, i2i_{2} respectively. By (2.16) we can construct a joint distribution for (Ys1,Ys2)(Y^{1}_{s},Y^{2}_{s}) such that

P(Ys1Ys2)\displaystyle P(Y^{1}_{s}\neq Y^{2}_{s}) =\displaystyle= ||Pi1(Xs=)-Pi2(Xs=)||\displaystyle||P_{i_{1}}(X_{s}=\cdot)-P_{i_{2}}(X_{s}=\cdot)||
\displaystyle\leq d¯(s).\displaystyle\bar{d}(s).

Now for each pair (j1,j2)(j_{1},j_{2}), we can use (2.16) to construct a joint distribution for (Ys+t1,Ys+t2)(Y^{1}_{s+t},Y^{2}_{s+t}) given (Ys1=j1,Ys2=j2)(Y^{1}_{s}=j_{1},Y^{2}_{s}=j_{2}) with the property that

P(Ys+t1Ys+t2|Ys1=j1,Ys2=j2)=||Pj1(Xt=)-Pj2(Xt=)||.P(Y^{1}_{s+t}\neq Y^{2}_{s+t}|Y^{1}_{s}=j_{1},Y^{2}_{s}=j_{2})=||P_{j_{1}}(X_{% t}=\cdot)-P_{j_{2}}(X_{t}=\cdot)||.

The right side is 00 if j1=j2j_{1}=j_{2}, and otherwise is at most d¯(t)\bar{d}(t). So unconditionally

P(Ys+t1Ys+t2)d¯(s)d¯(t)P(Y^{1}_{s+t}\neq Y^{2}_{s+t})\leq\bar{d}(s)\bar{d}(t)

and (2.16) establishes part (a) of the lemma. For part (b), the same argument (with 𝐘2{\bf Y}^{2} now being the stationary chain) shows

d(s+t)d(s)d¯(t)d(s+t)\leq d(s)\bar{d}(t) (2.17)

so that (b) will follow from the upper bound d¯(t)2d(t)\bar{d}(t)\leq 2d(t) in (c). But this upper bound is clear from the triangle inequality for variation distance. And the lower bound in (c) follows from the fact that μ||θ-μ||\mu\rightarrow||\theta-\mu|| is a convex function, so that averaging over jj with respect to π\pi in (2.15) can only decrease distance. Finally, the “decreasing” property for d¯(t)\bar{d}(t) follows from (a), and for d(t)d(t) follows from (2.17). \Box

The assertions of this section hold in either discrete or continuous time. But note that the numerical value of d(t)d(t) changes when we switch from a discrete-time chain to the continuized chain. In particular, for a discrete-time chain with period qq we have d(t)(q-1)/qd(t)\rightarrow(q-1)/q as tt\rightarrow\infty (which incidently implies, taking q=2q=2, that the factor 22 in Lemma 2.20(b) cannot be reduced) whereas for the continuized chain d(t)0d(t)\rightarrow 0.

One often sees slightly disguised corollaries of the submultiplicativity property in the literature. The following is a typical one.

Corollary 2.21

Suppose there exists a probability measure μ\mu, a real δ>0\delta>0 and a time tt such that

pij(t)δμji,j.p^{(t)}_{ij}\geq\delta\mu_{j}\ \forall i,j.

Then

d(s)(1-δ)s/t,  s0.d(s)\leq(1-\delta)^{\lfloor s/t\rfloor},\ s\geq 0.

Proof. The hypothesis implies d¯(t)1-δ\bar{d}(t)\leq 1-\delta, by the third equality in Lemma 2.19, and then the conclusion follows by submultiplicativity.

2.4.2 L2L^{2} distance

Another notion of distance, which is less intuitively natural but often more mathematically tractable, is L2L^{2} distance. This is defined with respect to some fixed reference probability distribution π\pi on II, which for our purposes will be the stationary distribution of some irreducible chain under consideration (and so πi>0i\pi_{i}>0\ \forall i). The L2L^{2} norm of a function f:IRf:I\rightarrow R is

||f||2=iπif2(i).||f||_{2}=\sqrt{\sum_{i}\pi_{i}f^{2}(i)}. (2.18)

We define the L2L^{2} norm of a signed measure ν\nu on II by

||ν||2=iνi2/πi.||\nu||_{2}=\sqrt{\sum_{i}\nu^{2}_{i}/\pi_{i}}. (2.19)

This may look confusing, because a signed measure ν\nu and a function ff are in a sense “the same thing”, being determined by values (f(i);iI)(f(i);i\in I) or (νi;iI)(\nu_{i};i\in I) which can be chosen arbitrarily. But the measure ν\nu can also be determined by its density function f(i)=νi/πif(i)=\nu_{i}/\pi_{i}, and so (2.18) and (2.19) say that the L2L^{2} norm of a signed measure is defined to be the L2L^{2} norm of its density function.

So ||θ-μ||2||\theta-\mu||_{2} is the “L2L^{2}” measure of distance between probability distributions θ,μ\theta,\mu. In particular, the distance between θ\theta and the reference distribution π\pi is

||θ-π||2=i(θi-πi)2πi=iθi2πi-1.||\theta-\pi||_{2}=\sqrt{\sum_{i}\frac{(\theta_{i}-\pi_{i})^{2}}{\pi_{i}}}=% \sqrt{\sum_{i}\frac{\theta^{2}_{i}}{\pi_{i}}\ \ -1}.

In Example 2.18 we find ||θ-π||2=1/4||\theta-\pi||_{2}=1/4.

Writing θ(t)\theta(t) for the distribution at time tt of a chain with stationary distribution π\pi, it is true (cf. Lemma 2.20(d) for variation distance) that ||θ(t)-π||2||\theta(t)-\pi||_{2} is decreasing with tt. Since there is a more instructive proof in the reversible case (Chapter 3 Lemma 23) (yyy 9/2/94 version) we won’t prove the general case (see Notes).

Analogous to the L2L^{2} norms are the L1L^{1} norms

||f||1=iπi|f(i)|||f||_{1}=\sum_{i}\pi_{i}|f(i)|
||ν||1=i|νi|.||\nu||_{1}=\sum_{i}|\nu_{i}|.

The Cauchy-Schwarz inequality gives ||||1||||2||\cdot||_{1}\leq||\cdot||_{2}. Note that in the definition of ||ν||1||\nu||_{1} the reference measure π\pi has “cancelled out”. Lemma 2.19 shows that for probability measures θ1,θ2\theta_{1},\theta_{2} the L1L^{1} distance is the same as variation distance, up to a factor of 22:

||θ1-θ2||=12||θ1-θ2||1.||\theta_{1}-\theta_{2}||={\textstyle\frac{1}{2}}||\theta_{1}-\theta_{2}||_{1}.

As a trivial example in the Markov chain setting, consider

Example 2.22

Take I={0,1,,n-1}I=\{0,1,\ldots,n-1\}, fix a parameter 0<a<10<a<1 and define a transition matrix

pij=a1(j=i+1modn)+1-an.p_{ij}=a1_{(j=i+1\ \bmod\ n)}\ +\ \frac{1-a}{n}.

In this example the tt-step transition probabilities are

pij(t)=at1(j=i+tmodn)+1-atnp^{(t)}_{ij}=a^{t}1_{(j=i+t\ \bmod\ n)}\ +\ \frac{1-a^{t}}{n}

and the stationary distribution π\pi is uniform. We calculate (for arbitrary jij\neq i)

d(t)=||Pi(Xt)-π||=(1-n-1)atd(t)=||P_{i}(X_{t}\in\cdot)-\pi||=(1-n^{-1})a^{t}
d¯(t)=||Pi(Xt)-Pj(Xt)||=at\bar{d}(t)=||P_{i}(X_{t}\in\cdot)-P_{j}(X_{t}\in\cdot)||=a^{t}
||Pi(Xt)-π||2=(n-1)1/2at.||P_{i}(X_{t}\in\cdot)-\pi||_{2}=(n-1)^{1/2}a^{t}.

2.4.3 Exponential tails of hitting times

The submultiplicative property of d¯(t)\bar{d}(t) is one instance of a general principle:

because our state space is finite, many quantities which converge to zero as tt\to\infty must converge exponentially fast, by iterating over worst-case initial states.

Here’s another instance, tails of hitting time distributions.

Consider the first hitting time TAT_{A} on a subset AA. Define tA*:=maxiEiTAt^{*}_{A}:=\max_{i}E_{i}T_{A}. For any initial distribution μ\mu, any time s>0s>0 and any integer m1m\geq 1,

Pμ(TA>ms|TA>(m-1)s)\displaystyle P_{\mu}(T_{A}>ms|T_{A}>(m-1)s) =\displaystyle= Pθ(TA>s) for some dist. θ\displaystyle P_{\theta}(T_{A}>s)\mbox{ for some dist. }\theta
\displaystyle\leq maxiPi(TA>s)\displaystyle\max_{i}P_{i}(T_{A}>s)
\displaystyle\leq tA*/s.\displaystyle t^{*}_{A}/s.

So by induction on mm

Pμ(TA>js)(tA*/s)jP_{\mu}(T_{A}>js)\leq(t^{*}_{A}/s)^{j}

implying

Pμ(TA>t)(tA*/s)t/s,t>0.P_{\mu}(T_{A}>t)\leq(t^{*}_{A}/s)^{\lfloor t/s\rfloor},\ \ \ t>0.

In continuous time, a good (asymptotically optimal) choice of ss is s=etA*s=et^{*}_{A}, giving the exponential tail bound

supμPμ(TA>t)exp(-tetA*),  0<t<.\sup_{\mu}P_{\mu}(T_{A}>t)\leq\exp\left(-\left\lfloor\frac{t}{et^{*}_{A}}% \right\rfloor\right),\ \ 0<t<\infty. (2.20)

A messier bound holds in discrete time, where we have to choose ss to be an integer.