Structural variation detection: Second generation sequencing

The following is an except from my research project into ovarian cancer at the Peter MacCallum Cancer Centre 

There are four primary methods of detecting structural variation (SV) within second-generation sequencing.  These include: read pair discordance, read depth analysis, split read analysis and sequence assembly.  Sequence assembly is not yet feasible with short-read whole genome human data at this point, so will not be discussed.

Read pair discordance

As mentioned previously, both SOLiD and Illumina sequencing technologies provide the opportunity to read two ends of the same strand of DNA.  The sequenced strands are of a known length, so there is an expectation that the two reads should be within a known distance of each other on the genome.

Paired end discordance methods take the reads which have been aligned to the reference genome and look at how far away the two pairs align.  If the first read is found the expected distance from the second read (and in the correct orientation) then the pair is said to be concordant.  If not, the read is discordant and may provide evidence that the sample genome differs structurally from the reference genome, providing evidence of a structural variation.

There are a large number of tools that utilize this basic methodology for the detection of structural variation.  A key metric for the use of these tools is the amount of citations that the tool has received, which are summarized below.

 

Tool

Author

Journal

Year

Citations (Oct-2011)

BreakDancer

Chen

Nature Methods

2009

68

VariationHunter

Hormozdiari

Genome Research

2009

52

PEMer

Korbel

Genome Biology

2009

38

MoDIL

Lee

Nature Methods

2009

34

GASV

Sindi

Bioinformatics

2009

18

NovelSeq

Hajirasouliha

Bioinformatics

2010

14

SVDetect

Zeitouni

Bioinformatics

2010

7

SLOPE

Abel

Bioinformatics

2010

6

 

 

Due to the number of tools available and the limitations of the word limit of this piece, I will only discuss BreakDancer in detail.

BreakDancer

BreakDancer uses the aligned genome by way of SAM or BAM files to look for areas within the sample genome that contain more discordant pairs than would be expected through random chance.  These regions are then classified into six types: normal, deletion, insertion, inversion, intra-chromosomal translocation, inter-chromosomal translocation.  Categorization is done depending on the size of insert size discordance and the orientation of the reads.  Regions with two or more discordant reads are considered for further analysis using a Poisson model that considers the number of supporting reads, the size of the anchoring region and the coverage of the genome.  The type of the structural variation call is then decided by type with the most anchored reads.

A key difficulty with using BreakDancer is the number of false positives that the tool creates, as seen by the following table:

 

 

Normal 

Tumour 

Inversions 

8,187 

1,304 

Deletions 

129 

8 

Intra-chromosomal translocations 

1,709 

4 

Total 

10,025

1,315

 

 

It should be noted that the configuration run, BreakDancer was not set to look for inter-chromosomal translocations.

These results imply that the normal sample has more structural variation than the tumour sample which is known not to be the case.  Interpreting these results will be a key aspect of this research project. 



Read depth techniques

A key advantage of read depth techniques is there ability to detect SV within highly repetitive regions of the genome, as “paired-end mapping frequently cannot unambiguously assign end sequences in duplicated regions, making it impossible to distinguish allelic and paralogous variation.” (Alkan, Kidd et al. 2009).

MrFAST

MrFAST (micro-read fast alignment search tool) is a tool designed to detect CNV using second generation sequence data.  It is also the most popular SV tool described in this review, having been cited by 103 papers as of October 2011.  There is also an additional version called “DrFast” which is designed for SOLiD color-space data.

Operation
MrFAST attempts to align the raw reads to a reference genome, much like aligners such as BioScope or MAQ.   However MrFAST differs in two key details.  Firstly, it does not attempt to map the full reads, instead breaking a read up into k-mers (with a default length of 12), which are then aligned.  Secondly, most aligners when faced with a read that matches multiple loci within the genome will select a loci at random.  MrFAST on the other hand will map the read to all matching loci in order to reduce variability.  Additionally MrFAST also tracks the “edit distance” for each read at each loci to both reduce the impact of sequencing error and also to enable the calling of SNPs, which is important in determining whether a called copy is actually functional or not.(Alkan, Kidd et al. 2009)

Validation
The results called by MrFAST were validated using Array-CGH and FISH Analysis. It is difficult to quantify the success rates of the Array-CGH validation as they only sought to validate those called duplication intervals that were not shared across all three of their samples, in which case they found a validation rate of 68%.  They also used FISH analysis to validate 11 duplicated loci that were different between two of their samples, finding the FISH results to be “highly consistent with the absolute copy number predicted by MrFAST” (Alkan, Kidd et al. 2009).

 

 

CNV-Seq

CNV-Seq uses a different model to MrFAST, in that rather than seeking to detect the absolute number of copies, CNV-Seq uses a comparative technique enabling the detection of differences between two samples.  This is of particular interest for cancer data where we have two sample, tumour and normal and the primary interest is in the differences.

Operation
Both samples are aligned to a reference genome.  Using a sliding window, the read depth of each window is calculated for each sample.  The read depth distributions are compared using a Poisson model (using a normal approximation) that enables the calculation of a probability that the difference is the result of random chance (Xie and Tammi 2009).

Validation
Calls made by CNV-Seq were validated against the low-coverage genomes of Dr Craig Venter and Dr J Watson (at 7.5x and 7.4x coverage), with the results compared to known regions achieving a 50% overlap of known regions.  Results were also compared to a-CGH micro-array experiments with the majority of calls not being validated, however this was seen as evidence of the superiority of the sensitivity of the CNV-Seq technique.

 


 

Split read techniques
Split read techniques attempt to locate the exact junction of a break point by looking for reads which have “hard clips”, which is to say that a large portion of the read maps to the reference genome, but the remaining section does not.  One possible explanation for a hard clip is that their is a structural variation in the region that the read spans. Thus part of the read may map to one loci, and the remaining hard clipped region may map to another - providing evidence of a structural variation.

This technique requires long reads (the CREST algorithm requires reads of > 75bp in length (Wang, Mullighan et al. 2011)) to be effective.  With shorter reads, the hard-clipped region is typically very short, meaning that it may map to a very large number of regions within the genome making it impossible to detect the secondary loci.