10.4 Miscellaneous graph algorithms‣ Chapter 10 Some Graph Theory and Randomized Algorithms (September 1 1999) ‣ Reversible Markov Chains and Random Walks on Graphs

10.4.1 Amplification of randomness

In practice, Monte Carlo simulations are done using deterministic pseudo-random number generators. Ideally one would prefer some physical device which generated “truly random” bits. Presumably any such physical random number generator would be rather slow compared to the speed of arithmetical calculations. This thinking has led to an area of theory in which the cost of a randomized algorithm is taken to be the number of truly random bits used.

Recall the Solovay-Strassen test of primality in Example 10.8. Philosophically, there is something unsettling about using a deterministic pseudo-random number generator in this context, so we regard this as a prototype example where one might want to use a hypothetical source of truly random bits. To pick a uniform random integer from $\{1,2,\ldots,n\}$ requires about $\log_{2}n$ random bits, so the cost of the algorithm as presented above is about $k\log_{2}n=(\log_{2}1/\varepsilon)\ (\log_{2}n)$ bits, where $\varepsilon$ is the prescribed allowable error probability. But one can use the existence of explicit expanders and results like Lemma 10.12 to devise an algorithm which requires fewer truly random bits. Suppose we have a $n$ -vertex $r$ -regular expander, and label the vertices $\{1,2,\ldots,n\}$ . To simulate a uniform random starting vertex and $t$ steps of the random walk requires about $\log_{2}n+t\log_{2}r$ bits. The chance that such a walk never hits the set $A$ of witnesses is, by Lemma 10.12, at most $\exp(-{\textstyle\frac{t}{2\tau_{2}}})$ . To make this chance $\leq\varepsilon$ we take $t=2\tau_{2}\ \log(1/\varepsilon)$ , and the cost becomes $\log_{2}n+2\tau_{2}(\log_{2}r)\ \log(1/\varepsilon)$ . Thus granted the existence of expanders on which we can efficiently list neighbors of any specified vertex in order to simulate the random walk, the method of simulating (dependent) integers $(w_{i})$ via the random walk (instead of independently) reduces the number of truly random bits required from $O((\log n)\times(\log 1/\varepsilon))$ to $O(\max(\log n,\ \log 1/\varepsilon))$ .

The idea of using random walks on expanders for such algorithmic purposes is due to Ajtai et al [4]. Following Impagliazzo and Zuckerman [188] one can abstract the idea to rather general randomized algorithms. Suppose we are given a randomized algorithm, intended to show whether an object $x\in{\cal X}$ has a property ${\cal P}$ by outputting “Yes” or “No”, and that for each $x$ the algorithm is correct with probability $\geq 2/3$ and uses at most $b$ random bits. Formally, the algorithm is a function $A:{\cal X}\times\{0,1\}^{b}\rightarrow\{\mbox{Yes,No}\}$ such that

\mbox{if }x\in{\cal P}\mbox{ then }2^{-b}|\{{\bf i}\in\{0,1\}^{b}:A(x,{\bf i})% =\mbox{YES}\}|\geq 2/3

\mbox{if }x\not\in{\cal P}\mbox{ then }2^{-b}|\{{\bf i}\in\{0,1\}^{b}:A(x,{\bf i% })=\mbox{YES}\}|\leq 1/3

where ${\cal P}\subset{\cal X}$ is the subset of all objects possessing the property. To make the probability of incorrect classification be $\leq\varepsilon$ we may simply repeat the algorithm $m$ times, where $m=\Theta(\log 1/\varepsilon)$ is chosen to make

P(\mbox{Binomial}(m,2/3)\leq m/2)\leq\varepsilon,

and output Yes or No according to the majority of the $m$ individual outputs. This requires $bm=\Theta(b\log 1/\varepsilon)$ random bits. But instead we may take $\{0,1\}^{b}$ as the vertices of a degree- $r$ expander, and simulate a uniform random starting vertex and $m$ steps of random walk on the expander, using about $b+m\log_{2}r$ random bits. For each of the $m+1$ vertices of $\{0,1\}^{b}$ visited by the walk $(Y_{i},0\leq i\leq m)$ , compute $A(x,Y_{i})$ , and output Yes or No according to the majority of the $m+1$ individual outputs. The error probability is at most

\max_{B}P_{\pi}\left({\textstyle\frac{N_{m+1}(B)}{m+1}}-\pi(B)\geq{\textstyle% \frac{1}{3}}\right)

where $N_{m+1}(B)$ is the number of visits to $B$ by the walk $(Y_{i},0\leq i\leq m)$ . By the large deviation bound for occupation measures (Theorem 10.11, yyy to be moved to other chapter) this error probability is at most

(1+c_{1}m/\tau_{2})\ \exp(-c_{2}m/\tau_{2})

for constants $c_{1}$ and $c_{2}$ . To reduce this below $\varepsilon$ requires $m=0(\tau_{2}\log(1/\varepsilon))$ . Thus the existence of (bounded-degree) expanders implies that the number of random bits required is only

b+m\log_{2}r=O(\max(b,\ \log 1/\varepsilon))

compared to $O(b\log(1/\varepsilon))$ using independent sampling.

10.4.2 Using random walk to define an objective function

In Chapter 6 section 8.2 (yyy currently at end of this Chapter; to be moved) we gave a standard use of the probabilistic method Here is a less standard use, from Aldous [13], where we use the sample path of a random walk to make a construction.

Consider a function $h$ defined on the vertices of a $n$ -vertex graph $G$ . Constrain $h$ to have no local minima except the global minimum (for simplicity, suppose the values of $h$ are distinct). We seek algorithms to find the vertex $v$ at which $h(v)$ is minimized. Any deterministic “descent” algorithm will work, but it might work slowly. Could there be some more sophisticated algorithm which always works quickly? One idea is multi-start descent. Pick $n^{1/2}$ vertices uniformly at random; from these, choose the vertex with minimum $h$ -value, and follow the greedy descent algorithm. On a degree- $d$ graph, the mean time is $O(dn^{1/2})$ . Now specialize to the case where $G$ is the d-cube. One can give examples where single-start (from a uniform random start) descent has mean time $\Omega(2^{(1-\varepsilon)d})$ , so from a worst-case mean-time viewpoint, multi-start is better. The next theorem shows that (again from a worst-case mean-time viewpoint), one cannot essentially improve on multi-start descent. Consider random walk on the $d$ -cube started at a uniform random vertex $U$ and let H(v) be the first hitting time on $v$ . Then $H$ is a random function satisfying the constraint, minimized at $v=U$ , but

Theorem 10.9 ([12])

Every algorithm for locating $U$ by examining values $H(v)$ requires examining a mean number $\Omega(2^{d/2-\varepsilon})$ of vertices.

The argument is simple in outline. As a preliminary calculation, consider random walk on the $d$ -cube of length $t_{0}=O(2^{d/2-\varepsilon})$ , started at ${\bf 0}$ , and let $L_{v}$ be the time of the last visit to $v$ , with $L_{v}=0$ if $v$ is not visited. Then

EL_{v}\leq\sum_{t=1}^{t_{0}}tP_{{\bf 0}}(X(t)=v)=O(1)

(10.13)

where the $O(1)$ bound holds because the worst-case $v$ for the sum is $v={\bf 0}$ and, switching to continuous time,

\int_{0}^{2^{d/2}}tP_{{\bf 0}}(\tilde{X}(t)={\bf 0})\ dt=\int_{0}^{2^{d/2}}t\ % \left(\frac{1+e^{-2t/d}}{2}\right)^{d}\ dt=O(1).

Now consider an algorithm which has evaluated $H(v_{1}),\ldots,H(v_{m})$ and write $t_{0}=\min_{i\leq m}H(v_{i})=H(v^{*})$ say. It does no harm to suppose $t_{0}=O(2^{d/2-\varepsilon})$ . Conditional on the information revealed by $H(v_{1}),\ldots,H(v_{m})$ , the distribution of the walk $(X(t);0\leq t\leq t_{0})$ is specified by

(a) take a random walk from a uniform random start $U$ , and condition on $X(t_{0})=v^{*}$ ;

(b) condition further on the walk not hitting $\{v_{i}\}$ before time $t_{0}$ .

The key point, which of course is technically hard to deal with, is that the conditioning in (b) has little effect. If we ignore the conditioning in (b), then by reversing time we see that the random variables $(H(v^{*})-H(v))^{+}$ have the same distribution as the random variables $L_{v}$ (up to vertex relabeling). So whatever vertex $v$ the algorithm chooses to evaluate next, inequality (10.13) shows that the mean improvement $E(H(v^{*})-H(v))^{+}$ in objective value is $O(1)$ , and so it takes $\Omega(2^{d/2-\varepsilon})$ steps to reduce the objective value from $2^{d/2-\varepsilon}$ to $0$ .

10.4.3 Embedding trees into the $d$ -cube

Consider again the $d$ -cube $I=\{0,1\}^{d}$ with Hamming distance $d({\bf i},{\bf j})$ . Let ${\cal B}$ be the vertices of a $M$ -vertex binary tree. For an embedding, i.e. an arbitrary function $\rho:{\cal B}\rightarrow I$ , define

\mbox{load}=\max_{{\bf i}\in I}\ |\{v\in{\cal B}:\rho(v)={\bf i}\}|

(v,w)

How can we choose an embedding which makes both load and dilation small? This was studied by Bhatt and Cai [47], as a toy model for parallel computation. In the model $I$ represents the set of processors, ${\cal B}$ represents the set of tasks being done at a particular time, the tree structure indicating tasks being split into sub-tasks. To assign tasks to processors, we desire no one processor to have many tasks (small load) and we desire processors working on tasks and their sub-tasks to be close (small dilation) to facilitate communication. As the computation proceeds the tree will undergo local changes, as tasks are completed and new tasks started and split into sub-tasks, and we desire to be able to update the embedding “locally” in response to local changes in the tree. Bhatt and Cai [47] investigated the natural random walk embedding, where the root of ${\cal B}$ is embedded at ${\bf 0}$ , and recursively each child $w$ of $v$ is embedded at the vertex $\rho(w)$ found at step $L$ (for even $L$ ) of a random walk started at $\rho(v)$ . So by construction, dilation $\leq L$ , and the mathematical issue is to estimate load. As before, the details are technically complicated, but let us outline one calculation. Clearly load = $\Omega(\max(1,M/2^{d}))$ , so we would like the mean number of vertices of ${\cal B}$ embedded at any particular vertex ${\bf i}$ to be $O(\max(1,M/2^{d}))$ . In bounding this mean, because $p_{{\bf 0}{\bf i}}(t)\leq p_{\bf 00}(t)$ for even $t$ (Chapter 7 Corollary 3) (yyy 1/31/94 version) we see that the worst-case ${\bf i}$ is ${\bf 0}$ , and then because $p_{\bf 00}(t)$ is decreasing in $t$ we see that the worst-case $M$ -vertex binary tree is a maximally balanced tree. Thus we want

\sum_{k=0}^{\log_{2}M}2^{k}p_{{\bf 00}}(kL)=O(\max(1,M/2^{d})).

(10.14)

From the analysis of random walk on the $d$ -cube (Chapter 5 Example 15) (yyy 4/23/96 version) one can show

p_{{\bf 00}}(k\log d)=O(\max(d^{-k},2^{-d})),\mbox{ uniformly in }k,d\geq 1.

It follows that (10.14) holds if we take $L=\lceil\log d\rceil$ .

Of course to bound the load we need to consider the maximally-loaded vertex, rather than a typical vertex. Considering $M=2^{d}$ for definiteness, if the $M$ vertices were assigned independently and uniformly, the mean load at a typical vertex would be 1 and classical arguments show the maximal load would be $\Theta(\frac{d}{\log d})$ . We have shown that with tree-embedding the mean load at a typical vertex is $O(1)$ , so analogously one can show the maximal load is $O(d/\log d)$ . However, [47] shows that by locally redistributing tasks assigned to the same processor, one can reduce the maximal load to $O(1)$ while maintaining the dilation at $O(\log d)$ .

10.4.4 Comparing on-line and off-line algorithms

Here we describe work of Coppersmith et al [99]. As in Chapter 3 section 2 (yyy 9/2/94 version) consider a weighted graph on $n$ vertices, but now write the edge-weights as $c_{ij}=c_{ji}>0$ and regard them as a matrix ${\bf C}$ of costs. Let ${\bf P}$ be the transition matrix of an irreducible Markov chain whose only transitions are along edges of the graph. For each $i$ and $j$ let $m(i,j)$ be the mean cost of the random walk from $i$ to $j$ , when traversing an edge $(v,w)$ incurs cost $c_{vw}$ . Define the stretch $s({\bf P},{\bf C})$ to be the smallest $s$ such that there exists $a<\infty$ such that, for arbitrary $v_{0},v_{1},\ldots,v_{k}$

\sum_{i=0}^{k-1}m({v_{i}},{v_{i+1}})\leq a+s\sum_{i=0}^{k-1}c_{v_{i}v_{i+1}}.

(10.15)

Note that $c({\bf P},{\bf C})$ is invariant under scaling of ${\bf C}$ .

Proposition 10.10 ([99])

(a) $s({\bf P},{\bf C})\geq n-1$ .

(b) If ${\bf P}$ is reversible and ${\bf C}$ is the matrix of mean commute times $E_{i}T_{j}+E_{j}T_{i}$ then $s({\bf P},{\bf C})=n-1$ .

(c) For any cost matrix $\tilde{{\bf C}}$ there exists a reversible transition matrix ${\bf P}$ with matrix ${\bf C}$ of mean commute times such that, for some constant $\alpha$ ,

c_{ij}\leq\alpha\tilde{c}_{ij}

c_{ij}=\alpha\tilde{c}_{ij}\mbox{ when }p_{ij}>0.

So from (b) and invariance under scaling, $s({\bf P},\tilde{{\bf C}})=n-1$ .

We shall prove (a) and (b), which are just variations of the standard theory of mean hitting times developed in Chapters 2 and 3. The proof of part (c) involves “convex programming” and is rather outside our scope. The algorithmic interpretations are also rather too lengthy to give in detail, but are easy to say in outline. Imagine a problem where it is required to pick a minimum-cost path, where the cost of a path consists of costs of traversing edges, together with extra costs and constraints. There is some optimal off-line solution, which may be hard to calculate. In such a problem, one may be able to use Proposition 10.10(c) to show that the algorithm which simply picks a random sample path (with transition matrix ${\bf P}$ from (c)) has mean cost not more than $n-1$ times the cost of the optimal path.

Proof. Write $\pi$ for the stationary distribution of ${\bf P}$ . Write $m^{+}(v,v)$ for the mean cost of an excursion from $v$ to $v$ , and write $\bar{c}=\sum_{v}\sum_{w}\pi_{v}p_{vw}c_{vw}$ . Then $m^{+}(v,v)=\bar{c}/\pi_{v}$ by the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94 version). and so

$\displaystyle n\bar{c}$	$\displaystyle=$	$\displaystyle\sum_{v}\pi_{v}m^{+}(v,v)$
	$\displaystyle=$	$\displaystyle\sum_{v}\pi_{v}\sum_{w}p_{vw}(c_{v,w}+m(w,v))$
	$\displaystyle=$	$\displaystyle\bar{c}+\sum_{v}\sum_{w}\pi_{v}p_{vw}m(w,v).$

In other words,

\sum_{v}\sum_{w}\pi_{w}p_{wv}m(v,w)=(n-1)\bar{c}.

(10.16)

Now apply the definition (10.15) of $s({\bf P},{\bf C})$ to the sequence of states visited by the stationary time-reversed chain ${\bf P}^{*}$ ; by considering the mean of each step,

\sum_{v}\sum_{w}\pi_{v}p^{*}_{vw}m(v,w)\leq s({\bf P},{\bf C})\sum_{v}\sum_{w}% \pi_{v}p^{*}_{vw}c_{vw}.

(10.17)

But the left sides of (10.16) and (10.17) are equal by definition of ${\bf P}^{*}$ , and the sum in the right of (10.17) equals $\bar{c}$ by symmetry of ${\bf C}$ , establishing (a). For (b), first note that the definition (10.15) of stretch is equivalent to

s({\bf P},{\bf C})=\max_{\sigma}\frac{\sum_{i}m(v_{i},v_{i+1})}{\sum_{i}c_{v_{% i},v_{i+1}}}

(10.18)

where $\sigma$ denotes a cycle $(v_{1},v_{2},\ldots,v_{m},v_{1})$ . Write $t(v,w)=E_{v}T_{w}$ . Fix a cycle $\sigma$ and write $\mu=\sum_{i}t(v_{i},v_{i+1})$ for the mean time to complete the cyclic tour. By the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94 version). the mean number of traversals of an edge $(v,w)$ during the tour is $\mu\pi_{y}p_{vw}$ , and hence the ratio in (10.18) can be written as

\frac{\sum_{i}t(v_{i},v_{i+1})}{\sum_{i}c_{v_{i},v_{i+1}}}\times\sum_{v}\sum_{% w}\pi_{v}p_{vw}c_{vw}.

(10.19)

Now the hypothesis of (b) is that ${\bf P}$ is reversible and $c_{vw}=t(v,w)+t(w,v)$ . So the second term of (10.19) equals $2(n-1)$ by Chapter 3 Lemma 6 (yyy 9/2/94 version) and the first term equals $1/2$ by the cyclic tour property Chapter 3 Lemma 1 (yyy 9/2/94 version). So for each cycle $\sigma$ the ratio in (10.18) equals $n-1$ , establishing (b).

10.4 Miscellaneous graph algorithms