The following is an except from my research project into ovarian cancer at the Peter MacCallum Cancer Centre
SOLiD sequencing uses “color-space”, which binds fluorescent primers to the DNA two nucleotides at a time. For example, a “blue” on the first di-nucleotide pair (bases 2 & 3) will correspond to a double of any nucleotide (e.g. AA, TT, GG, CC). The di-nucleotide to be read (bases 3 & 4) may then be “green” (AT, CG, GC or TT). If we know that the nucleotide 1 is a “C”, then the “blue” call can only correspond to nucleotide 2 being a C. Since we know that nucleotide 2 is a C, then “green” can only correspond to nucleotide 3 being a G. Thus it is important that the true base of the first nucleotide is known, as any mistake will mean that the entire read is incorrect. The first base is generally well known, as it is the last base of the adapter sequence (Applied Biosystems, 2008).
When a reference genome is also available, the “colour space” measurements enable the detection of errors. Without additional error correction a single error colour base call will affect the calling of all downstream nucleotides. However, where a reference is known, single incorrect calls can be detected when compared to the reference as they will appear as a read that matches perfectly (in color space) aside from one call, enabling the misread colour to be detected. Additionally, since a SNP will require two colour changes there is a clear distinction between a SNP call and a single erroneous call.
(Applied Biosystems, 2008)
The BioScope alignment software supplied by Applied Biosystems uses all this information to produce the alignment that we will be utilizing.
Long mate pairs enable the sequencing of two ends of a single strand of DNA that is of a known length. The distance between the two reads is known as the insert size, which is known prior to the alignment. By comparing the known insert size to the insert size following alignment; it is possible to detect structural variations such as tandem duplications.
In order to sequence DNA in this way, the DNA is randomly sheered, creating a distribution of different length DNA fragments. Fragments are then selected for based on their size (in our case 1500 bp), and are capped with adaptors by ligation. The capped fragments are then circularized, biding the two ends of the fragment. A nick translation reaction enables the circularized fragment to be cut from both sides of where the two ends are joined. By controlling the time and temperature of the reaction, the position of the cuts away from the join can be determined to control the length of DNA at each end (e.g. 50bp). The now bare ends of the fragment are then ligated with adaptors and are suitable for sequencing. This process is shown in the following diagram:
(Applied Biosystems, 2010)
Despite the error correcting technologies employed by this sequencing technology, the data is far from perfect. The detection of sequencing errors by relying on the reference sequence is not infallible as it may not be possible to know if a read that fails to map is because of a single colour space read error, or because read actually covers a structural variation. Obviously, the statistics far favour the likelihood of an error in reading (which is claimed to be 0.1%) (Applied Biosystems 2008) versus the chance of the read being correct and their being a structural variation, but when all these reads are corrected in this manner, it may be difficult to find “true” structural variations.
The library preparation process is difficult and needs to be exacting. The selection of fragments by size requires meticulous wet-lab work, with the difference between selecting for 600bp fragments and 6kb fragments being the difference between 1% Agarose solution and 0.8% Aragose solution (Applied Biosystems 2010). This process will always lead to significant variability in the length of the fragments. Given that it is the insert size that is the primary signal we are looking for, this is something that needs to be corrected for. However, the orientation of the pairs should not suffer from this variability to the same degree due to the simpler chemistry of how the different pair adaptors are joined to the fragment.