# 2.4 Two metrics on distributions

A major theme of this book is quantifying the convergence theorem (Theorem 2.2) to give estimates of how close the distribution of a chain is to the stationary distribution at finite times. Such quantifications require some explicit choice of “distance” between distributions, and two of the simplest choices are explained in this section. We illustrate with a trivial

###### Example 2.18

Rain or shine?

Suppose the true probability of rain tomorrow is 80% whereas we think the probability is 70%. How far off are we? In other words, what is the “distance” between $\pi$ and $\theta$, where

 $\pi(\mbox{rain})=0.8,\ \pi(\mbox{shine})=0.2$
 $\theta(\mbox{rain})=0.7,\ \theta(\mbox{shine})=0.3.$

Different notions of distance will give different numerical answers. Our first notion abstracts the idea that the “additive error” in this example is $0.8-0.7=0.1$.

## 2.4.1 Variation distance

Perhaps the simplest notion of distance between probability distributions is variation distance, defined as

 $||\theta_{1}-\theta_{2}||=\max_{A\subseteq I}|\theta_{1}(A)-\theta_{2}(A)|.$

So variation distance is just the maximum additive error one can make, in using the “wrong” distribution to evaluate the probability of an event. In example 2.18, variation distance is $0.1$. Several equivalent definitions are provided by

###### Lemma 2.19

For probability distributions $\theta_{1},\theta_{2}$ on a finite state space $I$,

 $\displaystyle\frac{1}{2}\sum_{i}|\theta_{1}(i)-\theta_{2}(i)|$ $\displaystyle=$ $\displaystyle\sum_{i}(\theta_{1}(i)-\theta_{2}(i))^{+}$ $\displaystyle=$ $\displaystyle\sum_{i}(\theta_{1}(i)-\theta_{2}(i))^{-}$ $\displaystyle=$ $\displaystyle 1-\sum_{i}\min(\theta_{1}(i),\theta_{2}(i))$ $\displaystyle=$ $\displaystyle\max_{A\subseteq I}|\theta_{1}(A)-\theta_{2}(A)|$ $\displaystyle=$ $\displaystyle\min P(V_{1}\neq V_{2})$

the minimum taken over random pairs $(V_{1},V_{2})$ such that $V_{m}$ has distribution $\theta_{m}$ ($m=1,2$). So each of these quantities equals the variation distance $||\theta_{1}-\theta_{2}||$.

Proof. The first three equalities are clear. For the fourth, set $B=\{i:\theta_{1}(i)>\theta_{2}(i)\}$. Then

 $\displaystyle\theta_{1}(A)-\theta_{2}(A)$ $\displaystyle=$ $\displaystyle\sum_{i\in A}(\theta_{1}(i)-\theta_{2}(i))$ $\displaystyle\leq$ $\displaystyle\sum_{i\in A\cap B}(\theta_{1}(i)-\theta_{2}(i))$ $\displaystyle\leq$ $\displaystyle\sum_{i\in B}(\theta_{1}(i)-\theta_{2}(i))$ $\displaystyle=$ $\displaystyle\sum_{i}(\theta_{1}(i)-\theta_{2}(i))^{+}$

with equality when $A=B$. This, and the symmetric form, establish the fourth equality. In the final equality, the “$\leq$” follows from

 $|\theta_{1}(A)-\theta_{2}(A)|=|P(V_{1}\in A)-P(V_{2}\in A)|\leq P(V_{2}\neq V_% {1}).$

And equality is attained by the following joint distribution. Let $\theta(i)=\min(\theta_{1}(i),\theta_{2}(i))$ and let

 $P(V_{1}=i,V_{2}=i)=\theta(i)$
 $P(V_{1}=i,V_{2}=j)=\frac{(\theta_{1}(i)-\theta(i))(\theta_{2}(j)-\theta(j))}{1% -\sum_{k}\theta(k)},\ i\neq j.$

(If the denominator is zero, then $\theta_{1}=\theta_{2}$ and the result is trivial.) $\Box$

In the context of Markov chains we may use

 $d_{i}(t):=||P_{i}(X_{t}=\cdot)-\pi(\cdot)||$ (2.13)

as a measure of deviation from stationarity at time $t$, for the chain started at state $i$. Also define

 $d(t):=\max_{i}d_{i}(t)$ (2.14)

as the worst-case deviation from stationarity. Finally, it is technically convenient to introduce also

 $\bar{d}(t)=\max_{ij}||P_{i}(X_{t}=\cdot)-P_{j}(X_{t}=\cdot)||.$ (2.15)

In Chapter 4 we discuss, for reversible chains, relations between these “variation distance” notions and other measures of closeness-to-stationarity, and discuss parameters $\tau$ measuring “time until $d(t)$ becomes small” and their relation to other parameters of the chain. For now, let’s just introduce a fundamental technical fact, the submultiplicativity property.

###### Lemma 2.20

(a) $\bar{d}(s+t)\leq\bar{d}(s)\bar{d}(t),\ s,t\geq 0$ [the submultiplicativity property].

(b) $d(s+t)\leq 2d(s)d(t),\ s,t\geq 0$ .

(c) $d(t)\leq\bar{d}(t)\leq 2d(t),\ t\geq 0.$

(d) $d(t)$ and $\bar{d}(t)$ decrease as $t$ increases.

Proof. We use the characterization of variation distance as

 $||\theta_{1}-\theta_{2}||=\min P(V_{1}\neq V_{2}),$ (2.16)

the minimum taken over random pairs $(V_{1},V_{2})$ such that $V_{m}$ has distribution $\theta_{m}$ ($m=1,2$).

Fix states $i_{1},i_{2}$ and times $s,t$, and let ${\bf Y}^{1},{\bf Y}^{2}$ denote the chains started at $i_{1}$, $i_{2}$ respectively. By (2.16) we can construct a joint distribution for $(Y^{1}_{s},Y^{2}_{s})$ such that

 $\displaystyle P(Y^{1}_{s}\neq Y^{2}_{s})$ $\displaystyle=$ $\displaystyle||P_{i_{1}}(X_{s}=\cdot)-P_{i_{2}}(X_{s}=\cdot)||$ $\displaystyle\leq$ $\displaystyle\bar{d}(s).$

Now for each pair $(j_{1},j_{2})$, we can use (2.16) to construct a joint distribution for $(Y^{1}_{s+t},Y^{2}_{s+t})$ given $(Y^{1}_{s}=j_{1},Y^{2}_{s}=j_{2})$ with the property that

 $P(Y^{1}_{s+t}\neq Y^{2}_{s+t}|Y^{1}_{s}=j_{1},Y^{2}_{s}=j_{2})=||P_{j_{1}}(X_{% t}=\cdot)-P_{j_{2}}(X_{t}=\cdot)||.$

The right side is $0$ if $j_{1}=j_{2}$, and otherwise is at most $\bar{d}(t)$. So unconditionally

 $P(Y^{1}_{s+t}\neq Y^{2}_{s+t})\leq\bar{d}(s)\bar{d}(t)$

and (2.16) establishes part (a) of the lemma. For part (b), the same argument (with ${\bf Y}^{2}$ now being the stationary chain) shows

 $d(s+t)\leq d(s)\bar{d}(t)$ (2.17)

so that (b) will follow from the upper bound $\bar{d}(t)\leq 2d(t)$ in (c). But this upper bound is clear from the triangle inequality for variation distance. And the lower bound in (c) follows from the fact that $\mu\rightarrow||\theta-\mu||$ is a convex function, so that averaging over $j$ with respect to $\pi$ in (2.15) can only decrease distance. Finally, the “decreasing” property for $\bar{d}(t)$ follows from (a), and for $d(t)$ follows from (2.17). $\Box$

The assertions of this section hold in either discrete or continuous time. But note that the numerical value of $d(t)$ changes when we switch from a discrete-time chain to the continuized chain. In particular, for a discrete-time chain with period $q$ we have $d(t)\rightarrow(q-1)/q$ as $t\rightarrow\infty$ (which incidently implies, taking $q=2$, that the factor $2$ in Lemma 2.20(b) cannot be reduced) whereas for the continuized chain $d(t)\rightarrow 0$.

One often sees slightly disguised corollaries of the submultiplicativity property in the literature. The following is a typical one.

###### Corollary 2.21

Suppose there exists a probability measure $\mu$, a real $\delta>0$ and a time $t$ such that

 $p^{(t)}_{ij}\geq\delta\mu_{j}\ \forall i,j.$

Then

 $d(s)\leq(1-\delta)^{\lfloor s/t\rfloor},\ s\geq 0.$

Proof. The hypothesis implies $\bar{d}(t)\leq 1-\delta$, by the third equality in Lemma 2.19, and then the conclusion follows by submultiplicativity.

## 2.4.2 $L^{2}$ distance

Another notion of distance, which is less intuitively natural but often more mathematically tractable, is $L^{2}$ distance. This is defined with respect to some fixed reference probability distribution $\pi$ on $I$, which for our purposes will be the stationary distribution of some irreducible chain under consideration (and so $\pi_{i}>0\ \forall i$). The $L^{2}$ norm of a function $f:I\rightarrow R$ is

 $||f||_{2}=\sqrt{\sum_{i}\pi_{i}f^{2}(i)}.$ (2.18)

We define the $L^{2}$ norm of a signed measure $\nu$ on $I$ by

 $||\nu||_{2}=\sqrt{\sum_{i}\nu^{2}_{i}/\pi_{i}}.$ (2.19)

This may look confusing, because a signed measure $\nu$ and a function $f$ are in a sense “the same thing”, being determined by values $(f(i);i\in I)$ or $(\nu_{i};i\in I)$ which can be chosen arbitrarily. But the measure $\nu$ can also be determined by its density function $f(i)=\nu_{i}/\pi_{i}$, and so (2.18) and (2.19) say that the $L^{2}$ norm of a signed measure is defined to be the $L^{2}$ norm of its density function.

So $||\theta-\mu||_{2}$ is the “$L^{2}$” measure of distance between probability distributions $\theta,\mu$. In particular, the distance between $\theta$ and the reference distribution $\pi$ is

 $||\theta-\pi||_{2}=\sqrt{\sum_{i}\frac{(\theta_{i}-\pi_{i})^{2}}{\pi_{i}}}=% \sqrt{\sum_{i}\frac{\theta^{2}_{i}}{\pi_{i}}\ \ -1}.$

In Example 2.18 we find $||\theta-\pi||_{2}=1/4$.

Writing $\theta(t)$ for the distribution at time $t$ of a chain with stationary distribution $\pi$, it is true (cf. Lemma 2.20(d) for variation distance) that $||\theta(t)-\pi||_{2}$ is decreasing with $t$. Since there is a more instructive proof in the reversible case (Chapter 3 Lemma 23) (yyy 9/2/94 version) we won’t prove the general case (see Notes).

Analogous to the $L^{2}$ norms are the $L^{1}$ norms

 $||f||_{1}=\sum_{i}\pi_{i}|f(i)|$
 $||\nu||_{1}=\sum_{i}|\nu_{i}|.$

The Cauchy-Schwarz inequality gives $||\cdot||_{1}\leq||\cdot||_{2}$. Note that in the definition of $||\nu||_{1}$ the reference measure $\pi$ has “cancelled out”. Lemma 2.19 shows that for probability measures $\theta_{1},\theta_{2}$ the $L^{1}$ distance is the same as variation distance, up to a factor of $2$:

 $||\theta_{1}-\theta_{2}||={\textstyle\frac{1}{2}}||\theta_{1}-\theta_{2}||_{1}.$

As a trivial example in the Markov chain setting, consider

###### Example 2.22

Take $I=\{0,1,\ldots,n-1\}$, fix a parameter $0 and define a transition matrix

 $p_{ij}=a1_{(j=i+1\ \bmod\ n)}\ +\ \frac{1-a}{n}.$

In this example the $t$-step transition probabilities are

 $p^{(t)}_{ij}=a^{t}1_{(j=i+t\ \bmod\ n)}\ +\ \frac{1-a^{t}}{n}$

and the stationary distribution $\pi$ is uniform. We calculate (for arbitrary $j\neq i$)

 $d(t)=||P_{i}(X_{t}\in\cdot)-\pi||=(1-n^{-1})a^{t}$
 $\bar{d}(t)=||P_{i}(X_{t}\in\cdot)-P_{j}(X_{t}\in\cdot)||=a^{t}$
 $||P_{i}(X_{t}\in\cdot)-\pi||_{2}=(n-1)^{1/2}a^{t}.$

## 2.4.3 Exponential tails of hitting times

The submultiplicative property of $\bar{d}(t)$ is one instance of a general principle:

because our state space is finite, many quantities which converge to zero as $t\to\infty$ must converge exponentially fast, by iterating over worst-case initial states.

Here’s another instance, tails of hitting time distributions.

Consider the first hitting time $T_{A}$ on a subset $A$. Define $t^{*}_{A}:=\max_{i}E_{i}T_{A}$. For any initial distribution $\mu$, any time $s>0$ and any integer $m\geq 1$,

 $\displaystyle P_{\mu}(T_{A}>ms|T_{A}>(m-1)s)$ $\displaystyle=$ $\displaystyle P_{\theta}(T_{A}>s)\mbox{ for some dist. }\theta$ $\displaystyle\leq$ $\displaystyle\max_{i}P_{i}(T_{A}>s)$ $\displaystyle\leq$ $\displaystyle t^{*}_{A}/s.$

So by induction on $m$

 $P_{\mu}(T_{A}>js)\leq(t^{*}_{A}/s)^{j}$

implying

 $P_{\mu}(T_{A}>t)\leq(t^{*}_{A}/s)^{\lfloor t/s\rfloor},\ \ \ t>0.$

In continuous time, a good (asymptotically optimal) choice of $s$ is $s=et^{*}_{A}$, giving the exponential tail bound

 $\sup_{\mu}P_{\mu}(T_{A}>t)\leq\exp\left(-\left\lfloor\frac{t}{et^{*}_{A}}% \right\rfloor\right),\ \ 0 (2.20)

A messier bound holds in discrete time, where we have to choose $s$ to be an integer.