next up previous
Next: A brief introduction Up: Using HMM's to Previous: Using HMM's to

Locating splice sites

As mentioned before, splice sites are boundaries between exons and intron, there are two varieties: the border going from exon to intron is called a donor site or a site, the border separating intron from exon is called an acceptor site or a site. We need to identify these as part of the gene-finding process. By looking at known splice sites, we have available to us date such as in table 3, page gif (taken from [8]).

  
Table 3: Base counts in positions using a large dataset of known donor sites. The last base in the exon is position -1, so position 0 corresponds to the first base of the intron.

From this table it is clear for example that the first two bases in an intron are almost always GT, and the last base of the exon is often G. We can make use of such data to score the potential of a certain region containing a donor site. Say we have a sequence . Then we need to calculate This must be calculated relative to the background to determine what is big and what is small - for example, the probability of a sequence would be small given a splice site, but it would also be small given the background.

So to look for a donor site in the sequence we might calculate

and plot for varying t. A donor site should appear as a sharp peak in the plot. The ``background'' used here could be local or global, but whichever it is, it must include non-splice sites.



Simon Cawley
Fri May 1 15:50:13 PDT 1998