There are quite a few steps which need to be completed between original DNA sequence and decoded sequence. We will briefly describe what is involved in each of them.The following is a schematic presentation of the main steps.
Model
Basically, at the end of the day, we want to end up with easy to
decode sequence. We would like the peaks to be evenly spaced, approximately
the same height, and the same baseline. We will see that it is
possible to get ``nice--looking'' peaks in the middle region
(100bps--500bp) but decoding becomes increasingly more difficult in
the later parts of the read, and in most cases it is hopeless after about
the base 800. There are also may be some problems in the first 50
bps. The following is an example of what the decoded signal looks like.
We can see that we can easily read off the peaks up to about scan 400,
but after that it becomes less clear where the real signal is.
What confuses the signal?
Cross-Talk
What's involved in basecalling? Decoding is the last step of it; btu, it has to be preceded by several data processing steps. These pre--processing steps vary from one basecalling algorithm to another, but the main features are the same, and they are necessary for successful basecalling.
In ABI machines, under standard conditions, the first base passes the read window 60 to 120 minutes after the start of electrophoresis. The exact time depends on the gel length configuration that is being used. Each full scan takes six seconds and the instrument performs 10 scans every minute. Therefore the first base should appear somewhere between scan numbers 600 and 1200. (ABI manual, [3])
The problems in determining the first peak may come when a salt front running ahead of the sequencing primer is misinterpreted as the first base. Also, excess dye-terminators not removed in post--reaction clean--up can cause basecalling to start too early.
The raw data file with n scans as is an n by 4 matrix where the i-th tuple corresponds to the intensity values at the four wavelengths detected (i.e. the strength of the signal at each wavelength). Let us denote these four intensities at each time point by a vector s in 'detector space'. For each vector s, we want to recover a vector f in 'dye--space' corresponding to the concentration of each dye at the that time point.
There are different strategies for finding an appropriate linear transformation. The following technique is straightforward and non--iterative, but it's not efficient in the sense that it is not using all the available information.
The transformation is performed with a matrix M. M
is 4 by 4 matrix in which each column represents the relative signal
intensity at each wavelength in detector space for one of the
dyes. The transformation from 'dye--space' to 'detector--space' is
given by
. For our purposes we will need to invert
the matrix, converting the 'detector space' vectors
to 'dye-space' vectors. M is inverted numerically and
used in the equation
. The components
of M may be determined by identifying a known peak in the raw
trace (detector space) for each of the four different
fluorophores. For each of the identified peaks, the relative signal
intensities at the four wavelengths are entered into a column of
the matrix M. M is the normalized, and is inverted to
obtain
.
The entries for M do not change between runs or samples unless different fluorophores are used or detection apparatus is modified. Thus, once a suitable matrix is found, it can be used without modification for subsequent analysis. M can be determined by finding four different dye-labeled peaks. The intensity values for each of these peaks are used as the entries for M.
Some other procedures use the entire set of peak intensities and find the transformation iteratively. The ABI algorithm, for instance, calculates the transformation matrix for every read rather than relying on pre--calculated matrix.
The dye--mobility shift is nearly linear over the large regions and thus can be compensated for by using a small constant offset in one or more of the channels to produce a correct alignment. The shift is determined experimentally for each of the fluorophores ahead of time.
One of the normalization techniques is a piecewise linear normalization. The data is divided into several windows of size N and for each window a scaling factor is found such that the data in the window would be normalized to the [0,1] range if scaled by this factor. To avoid discontinuities, the factors are interpolated.
Currently, there is work being done to develop a ``probabilistic'' basecaller, i.e. to try to model how data comes to be; however, it's still in a development stage.
The features of the trace that are frequently used to basecall include (but don't need to be limited to) relative peak heights, peak spacing relative to the spacings between peaks in the local region, and half-height width of the peaks.
For instance, Phred is the most widely used basecaller with the lowest miscalling rate. Phred uses a four--phase procedure to basecall. In the first phase, idealized peak locations (predicted peaks) are determined. The idea is to use the fact that fragments are locally relatively evenly spaced, on average, in most regions of the gel, to determine the correct number of bases and their idealized evenly spaced locations in regions where peaks are not well--resolved, noisy or displaced. In the second phase observed peaks are identified in the trace by finding maxima. In the third phase, observed peaks are matched to the predicted peak locations, omitting some peaks and splitting others (judging by the relative area of the peak). Each observed peak is associates with one of four bases, and hence the ordered list of matched observed peaks determines a base sequence for the trace. The third phase is the most computationally intensive and uses various alignment techniques. In the final phase, the uncalled (i.e. unmatched) observed peaks are checked for any peak that appears to represent a base but couldn't be assigned to a predicted peak in the third phase, and if necessary the peak is inserted into the sequence. The predicted peaks that could not be matched to any observed peak in the third phase are assigned a value N; however, this happens extremely rarely. (On the other hand, N's are very frequent when basecalling is done using ABI basecaller.) (B. Ewing et.al [7])
We had done a small comparative study on the four above mentioned basecallers,and Phred came out to be a clear winner. The study was done as following. Each of the reads was aligned against the sequence thought to be ``correct'' ( consensus sequence) and rate of discrepancies in alignment regions was compared among the basecallers. There was a penalty for having shorter alignment regions. Phred turned out to be have the longest ``usable'' (alignable) reads with the least overall discrepancy rate. The same held when the discrepancy rates were stratified by position. (Say, among bps number 1 though 100, 100 through 200 etc).
Phred assigns a score to each called base. The relationship between
a quality score q and a probability of error p is
or
. The look--up table based on which the score is
assigned, was derived by dividing the 4-dimensional
space into homogeneous subregions and calculating an empirical error
probability in each. The training set for the developing the table was
about
7 million bp and the test set was about 20 million of bps.
4 variables are employed in deriving the table. The quality is based on the local properties of the read where ``local'' is defined as either a window of 1 peak on both sides of a base of interest or of 3 peaks on both sides of a base of interest. (B. Ewing and P. Green, [8])
The quality scores were validated as following. Each of the bp was aligned against the ``correct'' sequence. The bp were grouped according their quality scores (i.e. all the bps with quality of, say, 20 were in one group, etc) and their predicted error rate was compared to empirical one.
The scores were approximately validated for dye--primer data, however they seemed to be a bit on conservative side a new Energy--Transfer dye--terminator data.
It would be interesting to develop a method for assigning a quality score which would be simpler and more robust than a look--up table.