In practice, Monte Carlo simulations are done using deterministic pseudo-random number generators. Ideally one would prefer some physical device which generated “truly random” bits. Presumably any such physical random number generator would be rather slow compared to the speed of arithmetical calculations. This thinking has led to an area of theory in which the cost of a randomized algorithm is taken to be the number of truly random bits used.
Recall the Solovay-Strassen test of primality in Example 10.8. Philosophically, there is something unsettling about using a deterministic pseudo-random number generator in this context, so we regard this as a prototype example where one might want to use a hypothetical source of truly random bits. To pick a uniform random integer from requires about random bits, so the cost of the algorithm as presented above is about bits, where is the prescribed allowable error probability. But one can use the existence of explicit expanders and results like Lemma 10.12 to devise an algorithm which requires fewer truly random bits. Suppose we have a -vertex -regular expander, and label the vertices . To simulate a uniform random starting vertex and steps of the random walk requires about bits. The chance that such a walk never hits the set of witnesses is, by Lemma 10.12, at most . To make this chance we take , and the cost becomes . Thus granted the existence of expanders on which we can efficiently list neighbors of any specified vertex in order to simulate the random walk, the method of simulating (dependent) integers via the random walk (instead of independently) reduces the number of truly random bits required from to .
The idea of using random walks on expanders for such algorithmic purposes is due to Ajtai et al [4]. Following Impagliazzo and Zuckerman [188] one can abstract the idea to rather general randomized algorithms. Suppose we are given a randomized algorithm, intended to show whether an object has a property by outputting “Yes” or “No”, and that for each the algorithm is correct with probability and uses at most random bits. Formally, the algorithm is a function such that
where is the subset of all objects possessing the property. To make the probability of incorrect classification be we may simply repeat the algorithm times, where is chosen to make
and output Yes or No according to the majority of the individual outputs. This requires random bits. But instead we may take as the vertices of a degree- expander, and simulate a uniform random starting vertex and steps of random walk on the expander, using about random bits. For each of the vertices of visited by the walk , compute , and output Yes or No according to the majority of the individual outputs. The error probability is at most
where is the number of visits to by the walk . By the large deviation bound for occupation measures (Theorem 10.11, yyy to be moved to other chapter) this error probability is at most
for constants and . To reduce this below requires . Thus the existence of (bounded-degree) expanders implies that the number of random bits required is only
compared to using independent sampling.
In Chapter 6 section 8.2 (yyy currently at end of this Chapter; to be moved) we gave a standard use of the probabilistic method Here is a less standard use, from Aldous [13], where we use the sample path of a random walk to make a construction.
Consider a function defined on the vertices of a -vertex graph . Constrain to have no local minima except the global minimum (for simplicity, suppose the values of are distinct). We seek algorithms to find the vertex at which is minimized. Any deterministic “descent” algorithm will work, but it might work slowly. Could there be some more sophisticated algorithm which always works quickly? One idea is multi-start descent. Pick vertices uniformly at random; from these, choose the vertex with minimum -value, and follow the greedy descent algorithm. On a degree- graph, the mean time is . Now specialize to the case where is the d-cube. One can give examples where single-start (from a uniform random start) descent has mean time , so from a worst-case mean-time viewpoint, multi-start is better. The next theorem shows that (again from a worst-case mean-time viewpoint), one cannot essentially improve on multi-start descent. Consider random walk on the -cube started at a uniform random vertex and let H(v) be the first hitting time on . Then is a random function satisfying the constraint, minimized at , but
Every algorithm for locating by examining values requires examining a mean number of vertices.
The argument is simple in outline. As a preliminary calculation, consider random walk on the -cube of length , started at , and let be the time of the last visit to , with if is not visited. Then
(10.13) |
where the bound holds because the worst-case for the sum is and, switching to continuous time,
Now consider an algorithm which has evaluated
and write say.
It does no harm to suppose .
Conditional on the information revealed by ,
the distribution of the walk
is specified by
(a) take a random walk from a uniform random start , and
condition on ;
(b) condition further on the walk not hitting before time .
The key point, which of course is technically hard to deal with, is
that the conditioning in (b) has little effect.
If we ignore the conditioning in (b), then by reversing time we see
that the random variables
have the same distribution as the random variables
(up to vertex relabeling).
So whatever vertex the algorithm chooses to evaluate next,
inequality (10.13) shows that
the mean improvement
in objective value is , and so it takes
steps to reduce the objective value from
to .
Consider again the -cube with Hamming distance . Let be the vertices of a -vertex binary tree. For an embedding, i.e. an arbitrary function , define
How can we choose an embedding which makes both load and dilation small? This was studied by Bhatt and Cai [47], as a toy model for parallel computation. In the model represents the set of processors, represents the set of tasks being done at a particular time, the tree structure indicating tasks being split into sub-tasks. To assign tasks to processors, we desire no one processor to have many tasks (small load) and we desire processors working on tasks and their sub-tasks to be close (small dilation) to facilitate communication. As the computation proceeds the tree will undergo local changes, as tasks are completed and new tasks started and split into sub-tasks, and we desire to be able to update the embedding “locally” in response to local changes in the tree. Bhatt and Cai [47] investigated the natural random walk embedding, where the root of is embedded at , and recursively each child of is embedded at the vertex found at step (for even ) of a random walk started at . So by construction, dilation , and the mathematical issue is to estimate load. As before, the details are technically complicated, but let us outline one calculation. Clearly load = , so we would like the mean number of vertices of embedded at any particular vertex to be . In bounding this mean, because for even (Chapter 7 Corollary 3) (yyy 1/31/94 version) we see that the worst-case is , and then because is decreasing in we see that the worst-case -vertex binary tree is a maximally balanced tree. Thus we want
(10.14) |
From the analysis of random walk on the -cube (Chapter 5 Example 15) (yyy 4/23/96 version) one can show
It follows that (10.14) holds if we take .
Of course to bound the load we need to consider the maximally-loaded vertex, rather than a typical vertex. Considering for definiteness, if the vertices were assigned independently and uniformly, the mean load at a typical vertex would be 1 and classical arguments show the maximal load would be . We have shown that with tree-embedding the mean load at a typical vertex is , so analogously one can show the maximal load is . However, [47] shows that by locally redistributing tasks assigned to the same processor, one can reduce the maximal load to while maintaining the dilation at .
Here we describe work of Coppersmith et al [99]. As in Chapter 3 section 2 (yyy 9/2/94 version) consider a weighted graph on vertices, but now write the edge-weights as and regard them as a matrix of costs. Let be the transition matrix of an irreducible Markov chain whose only transitions are along edges of the graph. For each and let be the mean cost of the random walk from to , when traversing an edge incurs cost . Define the stretch to be the smallest such that there exists such that, for arbitrary
(10.15) |
Note that is invariant under scaling of .
(a) .
(b) If is reversible and is the matrix of mean commute times then .
(c) For any cost matrix there exists a reversible transition matrix with matrix of mean commute times such that, for some constant ,
So from (b) and invariance under scaling, .
We shall prove (a) and (b), which are just variations of the standard theory of mean hitting times developed in Chapters 2 and 3. The proof of part (c) involves “convex programming” and is rather outside our scope. The algorithmic interpretations are also rather too lengthy to give in detail, but are easy to say in outline. Imagine a problem where it is required to pick a minimum-cost path, where the cost of a path consists of costs of traversing edges, together with extra costs and constraints. There is some optimal off-line solution, which may be hard to calculate. In such a problem, one may be able to use Proposition 10.10(c) to show that the algorithm which simply picks a random sample path (with transition matrix from (c)) has mean cost not more than times the cost of the optimal path.
Proof. Write for the stationary distribution of . Write for the mean cost of an excursion from to , and write . Then by the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94 version). and so
In other words,
(10.16) |
Now apply the definition (10.15) of to the sequence of states visited by the stationary time-reversed chain ; by considering the mean of each step,
(10.17) |
But the left sides of (10.16) and (10.17) are equal by definition of , and the sum in the right of (10.17) equals by symmetry of , establishing (a). For (b), first note that the definition (10.15) of stretch is equivalent to
(10.18) |
where denotes a cycle . Write . Fix a cycle and write for the mean time to complete the cyclic tour. By the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94 version). the mean number of traversals of an edge during the tour is , and hence the ratio in (10.18) can be written as
(10.19) |
Now the hypothesis of (b) is that is reversible and . So the second term of (10.19) equals by Chapter 3 Lemma 6 (yyy 9/2/94 version) and the first term equals by the cyclic tour property Chapter 3 Lemma 1 (yyy 9/2/94 version). So for each cycle the ratio in (10.18) equals , establishing (b).