Rainer sketched a proof of the Karlin-Altschul theorem (stated here in (1)) in class, citing Cramér's theorem, Erdos-Renyi theorem, generating function theory etc etc. Recently I have been looking at David Aldous' book Probability Approximations via the Poisson Clumping Heuristic. This method leads to a heuristic proof of the theorem. It's not guaranteed error-free, but hopefully the idea is right.
Suppose we have two protein sequences and with and in . We are interested in whether there is a local alignment between them. A local alignment of length l starting at is a hypothesized evolutionary correspondence between the subsequences and . We ignore the possibility of gaps, and have a null model that the sites are iid. We are given a similarity function with , for , and under the null model. The similarity score of the alignment is . The total alignment score is , and we would like to know its tail distribution under the null model.
Introducing some more convenient and familiar looking notation, suppose is a sequence of iid random variables with . Let . Set . A one-dimensional analog of the above problem is to find and we do this by invoking the Poisson clumping heuristic (Aldous p. 5) which thinks of as being generated by a Poisson process of rate with associated iid clumps. Suppose the size of the clumps has the same distribution as . We can then say, heuristically,
The key to the Poisson heuristic is the heuristic identity
Now introduce such that . Then , which comes from noticing that is a martingale, so the chance that ever hits is approximately , since otherwise it must eventually get absorbed at 0. The approximation would be exact if were continuous. This is basically the argument on Aldous p. 70.
Now we need to work out the expected length of the clusters. The simplest estimate comes from analogy to the continuous case. If is large then so in reverse the process looks locally like a random walk with negative drift. Looking forwards is not so clear, because although we have a similar relationship , the distribution of conditional on being large is more complicated. For a Brownian motion started at 0 with drift and variance , calculation of the distribution of total time in the positive half line (Aldous p. 72) gives
This suggests that for and we estimate the cluster size above b conditional on hitting b by . I can get a similar estimate for the clump size for the discrete case based on the central limit theorem and a few more approximations, but the calculations are quite long and I havn't worked out when/if the approximations are valid so I'll spare you the details.
The sequence alignment problem is essentially identical to the one-dimensional case already considered. We notice that since gaps are unallowed, potential alignments can only overlap if they are on the same diagonal line, . Thus clump size and the rate of the underlying Poisson process are the same, and we get
where and are the quantities described above, putting with the same distribution as under the iid model.
A special case of this to data-base searching is when a is some fixed protein sequence of interest and b is one of many proteins pulled out of a database. (1) then can be used to give a p-value for the observed alignment score. Although the iid model is not all that appropriate, looking at the proteins from the database with the smallest p-values for their alignment scores is a feasible way to go searching for homologies. The gap-free assumption is a bit dodgy, but is a common assumption used e.g. by BLAST to make the computations sufficiently quick that searching through a whole protein database is a matter of minutes. The advantage of comparing p-values rather than alignment scores is that it helps control for the variable length of b -- when people just looked at scores they used to get too many spurious matches from long proteins.
The limit theorem given in Karlin and Dembo (1992) has the same form as our approximation (1), namely
with the same as above and
where . We can compare this with the approximation in (1) in the example of Karlin and Altschul (1990) where takes value 1 with probability p, 0 with probability r and -1 with probability q where q>p. Here and Karlin and Altshul's formula gets modified for the case that takes values on a lattice, giving only the bounds
where the bounds come from by muliplying the value of in (2) by and with the lattice size here. Our estimate is
which is a little wide of the mark. For example if and then and .