**Ed Ionides**

**24th May**

Rainer sketched a proof of the Karlin-Altschul theorem (stated here in (1)) in class, citing Cramér's theorem, Erdos-Renyi theorem, generating function theory etc etc. Recently I have been looking at David Aldous' book Probability Approximations via the Poisson Clumping Heuristic. This method leads to a heuristic proof of the theorem. It's not guaranteed error-free, but hopefully the idea is right.

Suppose we have two protein sequences and
with and in . We
are interested in whether there is a local alignment between them. A
local alignment of length **l** starting at is a hypothesized
evolutionary correspondence between the subsequences
and . We ignore the
possibility of gaps, and have a null model that the sites are iid. We
are given a similarity function with ,
for , and under the null model. The
similarity score of the alignment is
. The total alignment
score is , and we would like to know its
tail distribution under the null model.

Introducing some more convenient and familiar looking notation, suppose is a sequence of iid random variables with . Let . Set . A one-dimensional analog of the above problem is to find and we do this by invoking the Poisson clumping heuristic (Aldous p. 5) which thinks of as being generated by a Poisson process of rate with associated iid clumps. Suppose the size of the clumps has the same distribution as . We can then say, heuristically,

The key to the Poisson heuristic is the heuristic identity

Now introduce such that . Then
, which comes from noticing that
is a martingale, so the chance
that ever hits is approximately , since otherwise it must eventually get absorbed at **0**. The
approximation would be exact if were continuous. This is basically the
argument on Aldous p. 70.

Now we need to work out the expected length of the clusters. The simplest estimate comes from analogy to the continuous case. If is large then so in reverse the process looks locally like a random walk with negative drift. Looking forwards is not so clear, because although we have a similar relationship , the distribution of conditional on being large is more complicated. For a Brownian motion started at 0 with drift and variance , calculation of the distribution of total time in the positive half line (Aldous p. 72) gives

This suggests that for and we
estimate the cluster size above **b** conditional on hitting **b** by . I can get a similar estimate for the clump size for
the discrete case based on the central limit theorem and a few more
approximations, but the calculations are quite long and I havn't worked out
when/if the approximations are valid so I'll spare you the details.

The sequence alignment problem is essentially identical to the one-dimensional case already considered. We notice that since gaps are unallowed, potential alignments can only overlap if they are on the same diagonal line, . Thus clump size and the rate of the underlying Poisson process are the same, and we get

where and are the quantities described above, putting with the same distribution as under the iid model.

A special case of this to data-base searching is when **a** is some
fixed protein sequence of interest and **b** is one of many proteins
pulled out of a database. (1) then can be used to give a
p-value for the observed alignment score. Although the iid model is
not all that appropriate, looking at the proteins from the database with the
smallest p-values for their alignment scores is a feasible way to
go searching for homologies. The gap-free assumption is a bit dodgy,
but is a common assumption used e.g. by BLAST to make the computations
sufficiently quick that searching through a whole protein database is
a matter of minutes. The advantage of comparing p-values rather than
alignment scores is that it helps control for the variable length of
**b** -- when people just looked at scores they used to get too many spurious
matches from long proteins.

The limit theorem given in Karlin and Dembo (1992) has the same form as our approximation (1), namely

with the same as above and

where . We can compare this with the
approximation in (1) in the example of Karlin and
Altschul (1990) where takes value 1 with probability **p**, 0 with
probability **r** and **-1** with probability **q** where **q>p**. Here
and Karlin and Altshul's
formula gets modified for the case that takes values on a lattice, giving
only the bounds

where the bounds come from by muliplying the value of in (2) by and with the lattice size here. Our estimate is

which is a little wide of the mark. For example if and then and .

- Aldous, D. (1989)
*Probability Approximations via the Poisson Clumping Heuristic*, Springer-Verlag. - Karlin, S., and Altschul, S. F. (1990) ``Method for assessing
the statistical significance of molecular sequence features by using
general scoring schemes,''
*Proceedings of the National Academy of Science, USA***87**, 2264-2268. - Karlin, S., and Dembo, A. (1992) ``Limit distributions of
maximal segmental score among markov-dependent partial sums,''
*Advances in Applied Probability***24**, 113-140.

Tue May 26 09:21:57 PDT 1998