2 General Markov Chains (September 10, 1999)

2.1 Notation and reminders of fundamental results

We recommend the textbook of Norris [270] for a clear treatment of the basic theory and a wide selection of applications.

Write I={i,j,k,}I=\{i,j,k,\ldots\} for a finite state space. Write 𝐏=pi,j{\bf P}=p_{i,j} for the transition matrix of a discrete-time Markov chain (Xt:t=0,1,2,)(X_{t}:t=0,1,2,\ldots). To avoid trivialities let’s exclude the one-state chain (two-state chains are useful, because surprisingly often general inequalities are sharp for two-state chains). The tt-step transition probabilities are P(Xt=j|X0=i)=pij(t)P(X_{t}=j|X_{0}=i)=p^{(t)}_{ij}, where 𝐏(t)=𝐏𝐏𝐏{\bf P}^{(t)}={\bf P}{\bf P}\ldots{\bf P} is the tt-fold matrix product. Write Pi()P_{i}(\cdot) and Ei()E_{i}(\cdot) for probabilities and expectations for the chain started at state ii and time 00. More generally, write Pρ()P_{\rho}(\cdot) and Eρ()E_{\rho}(\cdot) for probabilities and expectations for the chain started at time 00 with distribution ρ\rho. Write

Ti=min{t0:Xt=i}T_{i}=\min\{t\geq 0:X_{t}=i\}

for the first hitting time on state ii, and write

Ti+=min{t1:Xt=i}.T_{i}^{+}=\min\{t\geq 1:X_{t}=i\}.

Of course Ti+=TiT^{+}_{i}=T_{i} unless X0=iX_{0}=i, in which case we call Ti+T^{+}_{i} the first return time to state ii. More generally, a subset AA of states has first hitting time

TA=min{t0:XtA}.T_{A}=\min\{t\geq 0:X_{t}\in A\}.

We shall frequently use without comment “obvious” facts like the following.

Start a chain at state ii, wait until it first hits jj, then wait until the time (SS, say) at which it next hits kk. Then EiS=EiTj+EjTkE_{i}S=E_{i}T_{j}+E_{j}T_{k}.

The elementary proof sums over the possible values tt of TjT_{j}. The sophisticated proof appeals to the strong Markov property ([270] section 1.4) of the stopping time TjT_{j}, which implies

Ei(S|Xt,tTj)=Tj+EjTk.E_{i}(S|X_{t},t\leq T_{j})=T_{j}+E_{j}T_{k}.

Recall that the symbol |\ |\ is the probabilist’s shorthand for “conditional on”.

2.1.1 Stationary distribution and asymptotics

Now assume the chain is irreducible. A fundamental result ([270] Theorems 1.7.7 and 1.5.6) is that there exists a unique stationary distribution π=(πi:iI)\pi=(\pi_{i}:i\in I), i.e. a unique probability distribution satisfying the balance equations

πj=iπipij for all j.\pi_{j}=\sum_{i}\pi_{i}p_{ij}\mbox{ for all }j. (2.1)

One way to prove this existence (liked by probabilists because it extends easily to the countable state setting) is to turn Lemma 2.6 below into a definition. That is, fix arbitrary i0i_{0}, define π~(i0)=1\tilde{\pi}(i_{0})=1, and define

π~(j)=Ei0(number of visits to jj before time Ti0+T^{+}_{i_{0}}),  ji0.\tilde{\pi}(j)=E_{i_{0}}(\mbox{number of visits to $j$ before time $T^{+}_{i_{% 0}}$}),\ j\neq i_{0}.

It can then be checked that πi:=π~(i)/jπ~(j)\pi_{i}:=\tilde{\pi}(i)/\sum_{j}\tilde{\pi}(j) is a stationary distribution. The point of stationarity is that, if the initial position X0X_{0} of the chain is random with the stationary distribution π\pi, then the position XtX_{t} at any subsequent non-random time tt has the same distribution π\pi, and the process (Xt,t=0,1,2,)(X_{t},t=0,1,2,\ldots) is then called the stationary chain.

A highlight of elementary theory is that the stationary distribution plays the main role in asymptotic results, as follows.

Theorem 2.1 (The ergodic theorem: [270] Theorem 1.10.2)

Let Ni(t)N_{i}(t) be the number of visits to state ii during times 0,1,,t-10,1,\ldots,t-1. Then for any initial distribution,

t-1Ni(t)πi a.s., as t.t^{-1}N_{i}(t)\rightarrow\pi_{i}\mbox{ a.s., as }t\rightarrow\infty.
Theorem 2.2 (The convergence theorem: [270] Theorem 1.8.3)

For any initial distribution,

P(Xt=j)πj as t, for all jP(X_{t}=j)\rightarrow\pi_{j}\mbox{ as }t\rightarrow\infty,\mbox{ for all }j

provided the chain is aperiodic.

Theorem 2.1 is the simplest illustration of the ergodic principle “time averages equal space averages”. Many general identities for Markov chains can be regarded as aspects of the ergodic principle – in particular, in section 2.2.1 we use it to derive expressions for mean hitting times. Such identities are important and useful.

The most classical topic in mathematical probability is time-asymptotics for i.i.d. (independent, identically distributed) random sequences. A vast number of results are known, and (broadly speaking) have simple analogs for Markov chains. Thus the analog of the strong law of large numbers is Theorem 2.1, and the analog of the central limit theorem is Theorem 2.17 below. As mentioned in Chapter 1 section 2.1 (yyy 7/20/99 version) this book has a different focus, on results which say something about the behavior of the chain over some specific finite time, rather than what happens in the indefinite future.

2.1.2 Continuous-time chains

The theory of continuous-time Markov chains closely parallels that of the discrete-time chains discussed above. To the reader with background in algorithms or discrete mathematics, the introduction of continuous time may at first seem artificial and unnecessary, but it turns out that certain results are simpler in continuous time. See Norris [270] Chapters 2 and 3 for details on what follows.

A continuous-time chain is specified by transition rates (q(i,j)=qij,ji)(q(i,j)=q_{ij},j\neq i) which are required to be non-negative but have no constraint on the sums. Given the transition rates, define

qi:=j:jiqijq_{i}:=\sum_{j:j\neq i}q_{ij} (2.2)

and extend (qij)(q_{ij}) to a matrix 𝐐{\bf Q} by putting qii=-qi.q_{ii}=-q_{i}. The chain (Xt:0t<)(X_{t}:0\leq t<\infty) has two equivalent descriptions.

1. Infinitesimal description. Given that Xt=iX_{t}=i, the chance that Xt+dt=jX_{t+dt}=j is qijdtq_{ij}dt for each jij\neq i.

2. Jump-and-hold description. Define a transition matrix 𝐉{\bf J} by Jii=0J_{ii}=0 and

Jij:=qij/qi,  ji.J_{ij}:=q_{ij}/q_{i},\ j\neq i. (2.3)

Then the continuous-time chain may be constructed by the two-step procedure

(i) Run a discrete-time chain XJX^{J} with transition matrix 𝐉{\bf J}.

(ii) Given the sequence of states i0,i1,i2,i_{0},i_{1},i_{2},\ldots visited by XJX^{J}, the durations spent in states imi_{m} are independent exponential random variables with rates qimq_{i_{m}}.

The discrete-time chain XJX^{J} is called the jump chain associated with XtX_{t}.

The results in the previous section go over to continuous-time chains with the following modifications.

(a) Pi(Xt=j)=Qij(t)P_{i}(X_{t}=j)=Q^{(t)}_{ij}, where 𝐐(t):=exp(𝐐t){\bf Q}^{(t)}:=\exp({\bf Q}t).

(b) The definition of Ti+T^{+}_{i} becomes

Ti+=min{tTIi:Xt=i}.T^{+}_{i}=\min\{t\geq T_{I\setminus i}:X_{t}=i\}.

(c) If the chain is irreducible then there exists a unique stationary distribution π\pi characterized by

iπiqij=0 for all j.\sum_{i}\pi_{i}q_{ij}=0\mbox{ for all }j.

(d) In the ergodic theorem we interpret Ni(t)N_{i}(t) as the total duration of time spent in state ii during [0,t][0,t]:

Ni(t):=0t1(Xs=i)ds.N_{i}(t):=\int_{0}^{t}1_{(X_{s}=i)}\ ds.

(e) In the convergence theorem the assumption of aperiodicity is unnecessary. [This fact is the one of the technical advantages of continuous time.]

(f) The evolution of P(Xt=j)P(X_{t}=j) as a function of time is given by the forwards equations

ddtP(Xt=j)=iP(Xt=i)qij.\frac{d}{dt}P(X_{t}=j)=\sum_{i}P(X_{t}=i)q_{ij}. (2.4)

Given a discrete-time chain XX with some transition matrix 𝐏{\bf P}, one can define the continuized chain X~\widetilde{X} to have transition rates qij=pijq_{ij}=p_{ij}, jij\neq i. In other words, we replace the deterministic time-11 holds between jumps by holds with exponential(1)(1) distribution. Many quantities are unchanged by the passage from the discrete time chain to the continuized chain. In particular the stationary distribution π\pi and mean hitting times EiTAE_{i}T_{A} are unchanged. Therefore results stated in continuous time can often be immediately applied in discrete time, and vice versa.

In different parts of the book we shall be working with discrete or continuous time as a current convention, mentioning where appropriate how results change in the alternate setting. Chapter 4 (yyy section to be written) will give a survey of the differences between these two settings.