As mentioned before, splice sites are boundaries between exons and intron, there
are two varieties: the border going from exon to intron is called a donor site or
a
site, the border separating intron from exon is called an acceptor
site or a
site. We need to identify these as part of the gene-finding process.
By looking at known splice sites, we have available to us date such as in table 3,
page
(taken from [8]).
Table 3: Base counts in positions using a large dataset of known donor sites. The last
base in the exon is position -1, so position 0 corresponds to the first base of the intron.
From this table it is clear for example that the first two bases in
an intron are almost always GT, and the last base of the exon is often G. We can make use
of such data to score the potential of a certain region containing a donor site. Say
we have a sequence
. Then we need to calculate
This must be calculated relative to the background to determine what is big and what is
small - for example, the probability of a sequence
would be small given a
splice site, but it would also be small given the background.
So to look for a donor site in the sequence
we might calculate

and plot
for varying t. A donor site should appear as a sharp peak in the plot.
The ``background'' used here could be local or global, but whichever it is, it must include
non-splice sites.