Ed Ionides
24th May
Rainer sketched a proof of the Karlin-Altschul theorem (stated here in (1)) in class, citing Cramér's theorem, Erdos-Renyi theorem, generating function theory etc etc. Recently I have been looking at David Aldous' book Probability Approximations via the Poisson Clumping Heuristic. This method leads to a heuristic proof of the theorem. It's not guaranteed error-free, but hopefully the idea is right.
Suppose we have two protein sequences
and
with
and
in
. We
are interested in whether there is a local alignment between them. A
local alignment of length l starting at
is a hypothesized
evolutionary correspondence between the subsequences
and
. We ignore the
possibility of gaps, and have a null model that the sites are iid. We
are given a similarity function
with
,
for
, and
under the null model. The
similarity score of the alignment
is
. The total alignment
score is
, and we would like to know its
tail distribution under the null model.
Introducing some more convenient and familiar looking notation,
suppose
is a sequence of iid
random variables with
. Let
. Set
.
A one-dimensional analog of the above problem is to find
and we do this by invoking the Poisson
clumping heuristic (Aldous p. 5) which thinks of
as being
generated by a Poisson process of rate
with associated
iid clumps. Suppose the size of the clumps has the same distribution
as
. We can then say, heuristically,

The key to the Poisson heuristic is the heuristic identity

Now introduce
such that
. Then
, which comes from noticing that
is a martingale, so the chance
that
ever hits
is approximately
, since otherwise it must eventually get absorbed at 0. The
approximation would be exact if
were continuous. This is basically the
argument on Aldous p. 70.
Now we need to work out the expected length of the clusters. The
simplest estimate comes from analogy to the continuous case. If
is large then
so in reverse the process looks
locally like a random walk with negative drift. Looking forwards is
not so clear, because although we have a similar relationship
, the distribution of
conditional on
being large is more complicated. For a Brownian motion started at 0
with drift
and variance
, calculation of the
distribution of total time in the positive half line (Aldous p. 72)
gives

This suggests that for
and
we
estimate the cluster size above b conditional on hitting b by
. I can get a similar estimate for the clump size for
the discrete case based on the central limit theorem and a few more
approximations, but the calculations are quite long and I havn't worked out
when/if the approximations are valid so I'll spare you the details.
The sequence alignment problem is essentially identical to the
one-dimensional case already considered. We notice that since gaps are
unallowed, potential alignments can only overlap if they are on the
same diagonal line,
. Thus clump size
and the rate of the underlying Poisson process are the same, and we
get
where
and
are the quantities described above, putting
with the same distribution as
under the iid model.
A special case of this to data-base searching is when a is some fixed protein sequence of interest and b is one of many proteins pulled out of a database. (1) then can be used to give a p-value for the observed alignment score. Although the iid model is not all that appropriate, looking at the proteins from the database with the smallest p-values for their alignment scores is a feasible way to go searching for homologies. The gap-free assumption is a bit dodgy, but is a common assumption used e.g. by BLAST to make the computations sufficiently quick that searching through a whole protein database is a matter of minutes. The advantage of comparing p-values rather than alignment scores is that it helps control for the variable length of b -- when people just looked at scores they used to get too many spurious matches from long proteins.
The limit theorem given in Karlin and Dembo (1992) has the same form as our approximation (1), namely

with
the same as above and
where
. We can compare this with the
approximation in (1) in the example of Karlin and
Altschul (1990) where
takes value 1 with probability p, 0 with
probability r and -1 with probability q where q>p. Here
and Karlin and Altshul's
formula gets modified for the case that
takes values on a lattice, giving
only the bounds

where the bounds come from by muliplying the value of
in
(2) by
and
with the lattice size
here. Our estimate is

which is a little wide of the mark. For example if
and
then
and
.