10 Some Graph Theory and Randomized Algorithms (September 1 1999)

10.4 Miscellaneous graph algorithms

10.4.1 Amplification of randomness

In practice, Monte Carlo simulations are done using deterministic pseudo-random number generators. Ideally one would prefer some physical device which generated “truly random” bits. Presumably any such physical random number generator would be rather slow compared to the speed of arithmetical calculations. This thinking has led to an area of theory in which the cost of a randomized algorithm is taken to be the number of truly random bits used.

Recall the Solovay-Strassen test of primality in Example 10.8. Philosophically, there is something unsettling about using a deterministic pseudo-random number generator in this context, so we regard this as a prototype example where one might want to use a hypothetical source of truly random bits. To pick a uniform random integer from {1,2,,n}\{1,2,\ldots,n\} requires about log2n\log_{2}n random bits, so the cost of the algorithm as presented above is about klog2n=(log21/ε)(log2n)k\log_{2}n=(\log_{2}1/\varepsilon)\ (\log_{2}n) bits, where ε\varepsilon is the prescribed allowable error probability. But one can use the existence of explicit expanders and results like Lemma 10.12 to devise an algorithm which requires fewer truly random bits. Suppose we have a nn-vertex rr-regular expander, and label the vertices {1,2,,n}\{1,2,\ldots,n\}. To simulate a uniform random starting vertex and tt steps of the random walk requires about log2n+tlog2r\log_{2}n+t\log_{2}r bits. The chance that such a walk never hits the set AA of witnesses is, by Lemma 10.12, at most exp(-t2τ2)\exp(-{\textstyle\frac{t}{2\tau_{2}}}). To make this chance ε\leq\varepsilon we take t=2τ2log(1/ε)t=2\tau_{2}\ \log(1/\varepsilon), and the cost becomes log2n+2τ2(log2r)log(1/ε)\log_{2}n+2\tau_{2}(\log_{2}r)\ \log(1/\varepsilon). Thus granted the existence of expanders on which we can efficiently list neighbors of any specified vertex in order to simulate the random walk, the method of simulating (dependent) integers (wi)(w_{i}) via the random walk (instead of independently) reduces the number of truly random bits required from O((logn)×(log1/ε))O((\log n)\times(\log 1/\varepsilon)) to O(max(logn,  log1/ε))O(\max(\log n,\ \log 1/\varepsilon)).

The idea of using random walks on expanders for such algorithmic purposes is due to Ajtai et al [4]. Following Impagliazzo and Zuckerman [188] one can abstract the idea to rather general randomized algorithms. Suppose we are given a randomized algorithm, intended to show whether an object x𝒳x\in{\cal X} has a property 𝒫{\cal P} by outputting “Yes” or “No”, and that for each xx the algorithm is correct with probability 2/3\geq 2/3 and uses at most bb random bits. Formally, the algorithm is a function A:𝒳×{0,1}b{Yes,No}A:{\cal X}\times\{0,1\}^{b}\rightarrow\{\mbox{Yes,No}\} such that

if x𝒫 then 2-b|{𝐢{0,1}b:A(x,𝐢)=YES}|2/3\mbox{if }x\in{\cal P}\mbox{ then }2^{-b}|\{{\bf i}\in\{0,1\}^{b}:A(x,{\bf i})% =\mbox{YES}\}|\geq 2/3
if x𝒫 then 2-b|{𝐢{0,1}b:A(x,𝐢)=YES}|1/3\mbox{if }x\not\in{\cal P}\mbox{ then }2^{-b}|\{{\bf i}\in\{0,1\}^{b}:A(x,{\bf i% })=\mbox{YES}\}|\leq 1/3

where 𝒫𝒳{\cal P}\subset{\cal X} is the subset of all objects possessing the property. To make the probability of incorrect classification be ε\leq\varepsilon we may simply repeat the algorithm mm times, where m=Θ(log1/ε)m=\Theta(\log 1/\varepsilon) is chosen to make

P(Binomial(m,2/3)m/2)ε,P(\mbox{Binomial}(m,2/3)\leq m/2)\leq\varepsilon,

and output Yes or No according to the majority of the mm individual outputs. This requires bm=Θ(blog1/ε)bm=\Theta(b\log 1/\varepsilon) random bits. But instead we may take {0,1}b\{0,1\}^{b} as the vertices of a degree-rr expander, and simulate a uniform random starting vertex and mm steps of random walk on the expander, using about b+mlog2rb+m\log_{2}r random bits. For each of the m+1m+1 vertices of {0,1}b\{0,1\}^{b} visited by the walk (Yi,0im)(Y_{i},0\leq i\leq m), compute A(x,Yi)A(x,Y_{i}), and output Yes or No according to the majority of the m+1m+1 individual outputs. The error probability is at most

maxBPπ(Nm+1(B)m+1-π(B)13)\max_{B}P_{\pi}\left({\textstyle\frac{N_{m+1}(B)}{m+1}}-\pi(B)\geq{\textstyle% \frac{1}{3}}\right)

where Nm+1(B)N_{m+1}(B) is the number of visits to BB by the walk (Yi,0im)(Y_{i},0\leq i\leq m). By the large deviation bound for occupation measures (Theorem 10.11, yyy to be moved to other chapter) this error probability is at most

(1+c1m/τ2)exp(-c2m/τ2)(1+c_{1}m/\tau_{2})\ \exp(-c_{2}m/\tau_{2})

for constants c1c_{1} and c2c_{2}. To reduce this below ε\varepsilon requires m=0(τ2log(1/ε))m=0(\tau_{2}\log(1/\varepsilon)). Thus the existence of (bounded-degree) expanders implies that the number of random bits required is only

b+mlog2r=O(max(b,  log1/ε))b+m\log_{2}r=O(\max(b,\ \log 1/\varepsilon))

compared to O(blog(1/ε))O(b\log(1/\varepsilon)) using independent sampling.

10.4.2 Using random walk to define an objective function

In Chapter 6 section 8.2 (yyy currently at end of this Chapter; to be moved) we gave a standard use of the probabilistic method Here is a less standard use, from Aldous [13], where we use the sample path of a random walk to make a construction.

Consider a function hh defined on the vertices of a nn-vertex graph GG. Constrain hh to have no local minima except the global minimum (for simplicity, suppose the values of hh are distinct). We seek algorithms to find the vertex vv at which h(v)h(v) is minimized. Any deterministic “descent” algorithm will work, but it might work slowly. Could there be some more sophisticated algorithm which always works quickly? One idea is multi-start descent. Pick n1/2n^{1/2} vertices uniformly at random; from these, choose the vertex with minimum hh-value, and follow the greedy descent algorithm. On a degree-dd graph, the mean time is O(dn1/2)O(dn^{1/2}). Now specialize to the case where GG is the d-cube. One can give examples where single-start (from a uniform random start) descent has mean time Ω(2(1-ε)d)\Omega(2^{(1-\varepsilon)d}), so from a worst-case mean-time viewpoint, multi-start is better. The next theorem shows that (again from a worst-case mean-time viewpoint), one cannot essentially improve on multi-start descent. Consider random walk on the dd-cube started at a uniform random vertex UU and let H(v) be the first hitting time on vv. Then HH is a random function satisfying the constraint, minimized at v=Uv=U, but

Theorem 10.9 ([12])

Every algorithm for locating UU by examining values H(v)H(v) requires examining a mean number Ω(2d/2-ε)\Omega(2^{d/2-\varepsilon}) of vertices.

The argument is simple in outline. As a preliminary calculation, consider random walk on the dd-cube of length t0=O(2d/2-ε)t_{0}=O(2^{d/2-\varepsilon}), started at 𝟎{\bf 0}, and let LvL_{v} be the time of the last visit to vv, with Lv=0L_{v}=0 if vv is not visited. Then

ELvt=1t0tP𝟎(X(t)=v)=O(1)EL_{v}\leq\sum_{t=1}^{t_{0}}tP_{{\bf 0}}(X(t)=v)=O(1) (10.13)

where the O(1)O(1) bound holds because the worst-case vv for the sum is v=𝟎v={\bf 0} and, switching to continuous time,

02d/2tP𝟎(X~(t)=𝟎)dt=02d/2t(1+e-2t/d2)ddt=O(1).\int_{0}^{2^{d/2}}tP_{{\bf 0}}(\tilde{X}(t)={\bf 0})\ dt=\int_{0}^{2^{d/2}}t\ % \left(\frac{1+e^{-2t/d}}{2}\right)^{d}\ dt=O(1).

Now consider an algorithm which has evaluated H(v1),,H(vm)H(v_{1}),\ldots,H(v_{m}) and write t0=minimH(vi)=H(v*)t_{0}=\min_{i\leq m}H(v_{i})=H(v^{*}) say. It does no harm to suppose t0=O(2d/2-ε)t_{0}=O(2^{d/2-\varepsilon}). Conditional on the information revealed by H(v1),,H(vm)H(v_{1}),\ldots,H(v_{m}), the distribution of the walk (X(t);0tt0)(X(t);0\leq t\leq t_{0}) is specified by

(a) take a random walk from a uniform random start UU, and condition on X(t0)=v*X(t_{0})=v^{*};

(b) condition further on the walk not hitting {vi}\{v_{i}\} before time t0t_{0}.

The key point, which of course is technically hard to deal with, is that the conditioning in (b) has little effect. If we ignore the conditioning in (b), then by reversing time we see that the random variables (H(v*)-H(v))+(H(v^{*})-H(v))^{+} have the same distribution as the random variables LvL_{v} (up to vertex relabeling). So whatever vertex vv the algorithm chooses to evaluate next, inequality (10.13) shows that the mean improvement E(H(v*)-H(v))+E(H(v^{*})-H(v))^{+} in objective value is O(1)O(1), and so it takes Ω(2d/2-ε)\Omega(2^{d/2-\varepsilon}) steps to reduce the objective value from 2d/2-ε2^{d/2-\varepsilon} to 00.

10.4.3 Embedding trees into the dd-cube

Consider again the dd-cube I={0,1}dI=\{0,1\}^{d} with Hamming distance d(𝐢,𝐣)d({\bf i},{\bf j}). Let {\cal B} be the vertices of a MM-vertex binary tree. For an embedding, i.e. an arbitrary function ρ:I\rho:{\cal B}\rightarrow I, define

load=max𝐢I|{v:ρ(v)=𝐢}|\mbox{load}=\max_{{\bf i}\in I}\ |\{v\in{\cal B}:\rho(v)={\bf i}\}|
dilation=maxedges (v,w)(v,w) of {\cal B}d(ρ(v),ρ(w)).\mbox{dilation}=\max_{\mbox{edges $(v,w)$ of ${\cal B}$}}\ d(\rho(v),\rho(w)).

How can we choose an embedding which makes both load and dilation small? This was studied by Bhatt and Cai [47], as a toy model for parallel computation. In the model II represents the set of processors, {\cal B} represents the set of tasks being done at a particular time, the tree structure indicating tasks being split into sub-tasks. To assign tasks to processors, we desire no one processor to have many tasks (small load) and we desire processors working on tasks and their sub-tasks to be close (small dilation) to facilitate communication. As the computation proceeds the tree will undergo local changes, as tasks are completed and new tasks started and split into sub-tasks, and we desire to be able to update the embedding “locally” in response to local changes in the tree. Bhatt and Cai [47] investigated the natural random walk embedding, where the root of {\cal B} is embedded at 𝟎{\bf 0}, and recursively each child ww of vv is embedded at the vertex ρ(w)\rho(w) found at step LL (for even LL) of a random walk started at ρ(v)\rho(v). So by construction, dilation L\leq L, and the mathematical issue is to estimate load. As before, the details are technically complicated, but let us outline one calculation. Clearly load = Ω(max(1,M/2d))\Omega(\max(1,M/2^{d})), so we would like the mean number of vertices of {\cal B} embedded at any particular vertex 𝐢{\bf i} to be O(max(1,M/2d))O(\max(1,M/2^{d})). In bounding this mean, because p𝟎𝐢(t)p𝟎𝟎(t)p_{{\bf 0}{\bf i}}(t)\leq p_{\bf 00}(t) for even tt (Chapter 7 Corollary 3) (yyy 1/31/94 version) we see that the worst-case 𝐢{\bf i} is 𝟎{\bf 0}, and then because p𝟎𝟎(t)p_{\bf 00}(t) is decreasing in tt we see that the worst-case MM-vertex binary tree is a maximally balanced tree. Thus we want

k=0log2M2kp𝟎𝟎(kL)=O(max(1,M/2d)).\sum_{k=0}^{\log_{2}M}2^{k}p_{{\bf 00}}(kL)=O(\max(1,M/2^{d})). (10.14)

From the analysis of random walk on the dd-cube (Chapter 5 Example 15) (yyy 4/23/96 version) one can show

p𝟎𝟎(klogd)=O(max(d-k,2-d)), uniformly in k,d1.p_{{\bf 00}}(k\log d)=O(\max(d^{-k},2^{-d})),\mbox{ uniformly in }k,d\geq 1.

It follows that (10.14) holds if we take L=logdL=\lceil\log d\rceil.

Of course to bound the load we need to consider the maximally-loaded vertex, rather than a typical vertex. Considering M=2dM=2^{d} for definiteness, if the MM vertices were assigned independently and uniformly, the mean load at a typical vertex would be 1 and classical arguments show the maximal load would be Θ(dlogd)\Theta(\frac{d}{\log d}). We have shown that with tree-embedding the mean load at a typical vertex is O(1)O(1), so analogously one can show the maximal load is O(d/logd)O(d/\log d). However, [47] shows that by locally redistributing tasks assigned to the same processor, one can reduce the maximal load to O(1)O(1) while maintaining the dilation at O(logd)O(\log d).

10.4.4 Comparing on-line and off-line algorithms

Here we describe work of Coppersmith et al [99]. As in Chapter 3 section 2 (yyy 9/2/94 version) consider a weighted graph on nn vertices, but now write the edge-weights as cij=cji>0c_{ij}=c_{ji}>0 and regard them as a matrix 𝐂{\bf C} of costs. Let 𝐏{\bf P} be the transition matrix of an irreducible Markov chain whose only transitions are along edges of the graph. For each ii and jj let m(i,j)m(i,j) be the mean cost of the random walk from ii to jj, when traversing an edge (v,w)(v,w) incurs cost cvwc_{vw}. Define the stretch s(𝐏,𝐂)s({\bf P},{\bf C}) to be the smallest ss such that there exists a<a<\infty such that, for arbitrary v0,v1,,vkv_{0},v_{1},\ldots,v_{k}

i=0k-1m(vi,vi+1)a+si=0k-1cvivi+1.\sum_{i=0}^{k-1}m({v_{i}},{v_{i+1}})\leq a+s\sum_{i=0}^{k-1}c_{v_{i}v_{i+1}}. (10.15)

Note that c(𝐏,𝐂)c({\bf P},{\bf C}) is invariant under scaling of 𝐂{\bf C}.

Proposition 10.10 ([99])

(a) s(𝐏,𝐂)n-1s({\bf P},{\bf C})\geq n-1.

(b) If 𝐏{\bf P} is reversible and 𝐂{\bf C} is the matrix of mean commute times EiTj+EjTiE_{i}T_{j}+E_{j}T_{i} then s(𝐏,𝐂)=n-1s({\bf P},{\bf C})=n-1.

(c) For any cost matrix 𝐂~\tilde{{\bf C}} there exists a reversible transition matrix 𝐏{\bf P} with matrix 𝐂{\bf C} of mean commute times such that, for some constant α\alpha,

cijαc~ijc_{ij}\leq\alpha\tilde{c}_{ij}
cij=αc~ij when pij>0.c_{ij}=\alpha\tilde{c}_{ij}\mbox{ when }p_{ij}>0.

So from (b) and invariance under scaling, s(𝐏,𝐂~)=n-1s({\bf P},\tilde{{\bf C}})=n-1.

We shall prove (a) and (b), which are just variations of the standard theory of mean hitting times developed in Chapters 2 and 3. The proof of part (c) involves “convex programming” and is rather outside our scope. The algorithmic interpretations are also rather too lengthy to give in detail, but are easy to say in outline. Imagine a problem where it is required to pick a minimum-cost path, where the cost of a path consists of costs of traversing edges, together with extra costs and constraints. There is some optimal off-line solution, which may be hard to calculate. In such a problem, one may be able to use Proposition 10.10(c) to show that the algorithm which simply picks a random sample path (with transition matrix 𝐏{\bf P} from (c)) has mean cost not more than n-1n-1 times the cost of the optimal path.

Proof. Write π\pi for the stationary distribution of 𝐏{\bf P}. Write m+(v,v)m^{+}(v,v) for the mean cost of an excursion from vv to vv, and write c¯=vwπvpvwcvw\bar{c}=\sum_{v}\sum_{w}\pi_{v}p_{vw}c_{vw}. Then m+(v,v)=c¯/πvm^{+}(v,v)=\bar{c}/\pi_{v} by the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94 version). and so

nc¯\displaystyle n\bar{c} =\displaystyle= vπvm+(v,v)\displaystyle\sum_{v}\pi_{v}m^{+}(v,v)
=\displaystyle= vπvwpvw(cv,w+m(w,v))\displaystyle\sum_{v}\pi_{v}\sum_{w}p_{vw}(c_{v,w}+m(w,v))
=\displaystyle= c¯+vwπvpvwm(w,v).\displaystyle\bar{c}+\sum_{v}\sum_{w}\pi_{v}p_{vw}m(w,v).

In other words,

vwπwpwvm(v,w)=(n-1)c¯.\sum_{v}\sum_{w}\pi_{w}p_{wv}m(v,w)=(n-1)\bar{c}. (10.16)

Now apply the definition (10.15) of s(𝐏,𝐂)s({\bf P},{\bf C}) to the sequence of states visited by the stationary time-reversed chain 𝐏*{\bf P}^{*}; by considering the mean of each step,

vwπvpvw*m(v,w)s(𝐏,𝐂)vwπvpvw*cvw.\sum_{v}\sum_{w}\pi_{v}p^{*}_{vw}m(v,w)\leq s({\bf P},{\bf C})\sum_{v}\sum_{w}% \pi_{v}p^{*}_{vw}c_{vw}. (10.17)

But the left sides of (10.16) and (10.17) are equal by definition of 𝐏*{\bf P}^{*}, and the sum in the right of (10.17) equals c¯\bar{c} by symmetry of 𝐂{\bf C}, establishing (a). For (b), first note that the definition (10.15) of stretch is equivalent to

s(𝐏,𝐂)=maxσim(vi,vi+1)icvi,vi+1s({\bf P},{\bf C})=\max_{\sigma}\frac{\sum_{i}m(v_{i},v_{i+1})}{\sum_{i}c_{v_{% i},v_{i+1}}} (10.18)

where σ\sigma denotes a cycle (v1,v2,,vm,v1)(v_{1},v_{2},\ldots,v_{m},v_{1}). Write t(v,w)=EvTwt(v,w)=E_{v}T_{w}. Fix a cycle σ\sigma and write μ=it(vi,vi+1)\mu=\sum_{i}t(v_{i},v_{i+1}) for the mean time to complete the cyclic tour. By the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94 version). the mean number of traversals of an edge (v,w)(v,w) during the tour is μπypvw\mu\pi_{y}p_{vw}, and hence the ratio in (10.18) can be written as

it(vi,vi+1)icvi,vi+1×vwπvpvwcvw.\frac{\sum_{i}t(v_{i},v_{i+1})}{\sum_{i}c_{v_{i},v_{i+1}}}\times\sum_{v}\sum_{% w}\pi_{v}p_{vw}c_{vw}. (10.19)

Now the hypothesis of (b) is that 𝐏{\bf P} is reversible and cvw=t(v,w)+t(w,v)c_{vw}=t(v,w)+t(w,v). So the second term of (10.19) equals 2(n-1)2(n-1) by Chapter 3 Lemma 6 (yyy 9/2/94 version) and the first term equals 1/21/2 by the cyclic tour property Chapter 3 Lemma 1 (yyy 9/2/94 version). So for each cycle σ\sigma the ratio in (10.18) equals n-1n-1, establishing (b).