Overcoming the data deluge of Next Generation Sequencing with Compact Data Structures

Overcoming the data deluge of Next Generation Sequencing with Compact Data Structures

Introduction

Large scale sequencing projects such as the 1000 Genomes project, Cancer Genome Atlas Project (TCGA) and the International Cancer Genome Consortium (ICGC) seek to sequence the genomes of thousands of samples.  A single study into breast cancer by the TCGA itself sequenced the whole exomes of 507 samples1.  As the cost of sequencing a whole genome drops to near $1,000, this same type of research is going to become even more accessible to researchers.  However the data generated by these experiments pose unique problems for researchers due to the volume and the complexity of the required analysis. 

The majority of analysis conducted within re-sequencing projects is in determining how the sequenced sample differs from a reference genome.  As such, the majority of analysis is conducted on the mapped sequence files which contain information on how each read generated by the sequencer maps back to a reference genome. The mapping files enable the detection of different types of mutations, including single nucleotide polymorphisms (SNP’s), Copy Number Variation (CNV) and structural variations.  These files utilize a file format called Sequence Alignment/Map2 (SAM) which provides a text format for the information.  A more space efficient binary representation of this format is also available (BAM).

Due to the often hundreds of millions or billions of reads generated by next generation sequencing machinery for a single sample, these files are usually in order of hundreds of gigabytes in size.  They are therefore extremely difficult to work with without significant computing resources.  Simply copying the data from a single sample can take hours, let alone common operations such as merging, indexing and sorting.  This problem is even more serious in cancer studies where the standard is to sequence not only the genomes of tumour tissue, but also to sequence surrounding healthy tissue so as to provide a baseline for further comparison – effectively doubling the amount of data generated.

Thankfully, sequencing data is static, meaning that once it is written it does not change.  This makes it well suited to compression and data structures that are optimized for space rather than for updates and changes.  Succinct and compact data structures store data at near to the minimum entropy bounds (the theoretical minimum amount of data required to express the information), whilst still allowing for efficient access to the underlying data3.  Succinct data structures have already been utilized within genomics, with a short-read de-novo assembler, Gossamer utilizing them heavily4.  If it is possible to represent the information stored in BAM files using data structures optimized for the types of analysis typically required, then it may be possible to significantly improve the ease and cost at which analysis on genomic and transcriptomic material can be conducted.


 

Use cases for analysing genomic material

When identifying mutations in next generation sequencing data, the different classes of mutations are identified in different ways and require different subsets of the data stored within the alignment file. Therefore by extracting only the information that is useful for these specific types of analysis, a much smaller amount of data can be analysed, thus enabling much faster and cost effective data analysis.  For example, the individual base quality ratings may be useful for SNP calling, but are much less important for calling structural variations.  Since one of the easiest ways to make a computation faster is to reduce the amount of data that to be processed, extracting only the data required to identify each class of mutation, the computation requirements can be dramatically reduced.

It may be argued that by extracting the information for each type of analysis you are reading the entire file and therefore, since reading the file is a major overhead, you might as well do the analysis as you read.  The problem with this line of argument is that it does not reflect how bioinformatic analysis is conducted.  Most bioinformatic tools have many, many parameters which need to be tuned to the specific data set available.  For example, the read-depth of a sample may affect how many reads are discarded due to being considered duplicates.  The reality is that for most analysis, tools need to be run many times over the same data set before producing results that are useful.  To this end, doing a single read of the original file to produce smaller data structures suited to specific analysis tasks makes sense, as the initial overhead in generating the new structures will more than be made up for over repeated runs of downstream tools.

The data structures utilized to store the relevant information for each use case may also be optimized for query performance and size.  Even though these structures will house only a subset of the data contained in the BAM, the structures may still need to store billions of values.  It is important that these structures support efficient query access to ensure that unnecessary work is not being computed (e.g. checking each value may not be required if the results are sorted where a binary search may be possible).

In order to assess the benefit of smaller data structures, it is important to consider the architecture of modern computers.  Unfortunately modern CPU architectures are very complex, but the following table illustrates the approximate different levels of latency for accessing the different levels of cache within the Intel i7 Xeon range of processors 5.

Type

Nanoseconds

Size

L1

0.5

64KB

L2

7.0

256KB

L3

40.0

10MB

Main memory

100.0

Gigabytes

Disk

10,000,000.0

Terabytes

 

Here we see that the latency of access varies widely and that the memory with the smallest capacity is the fastest by many orders of magnitude.  Therefore the more compact our data structures can be, the better the performance of our analysis as they become more likely to fit into a better performing level of cache.

Succinct, Implicit and Compact data structures

There is a lower bound on the number of bits required to be able to store information.  This is known as the “information-theoretical lower bound”6 and is dependent on the possible number of values a value can take and the number of values that are to be stored.  For example, to encode the outcome of the tosses of a coin the possible number of values to record is two, heads and tails.  Therefore, each toss of the coin can be represented using 1 bit, using “1” to represent a head and “0” to represent a tail.  A sequence of ten head tosses could therefore be represented as 00 10 01 10 01, using ten bits.  This representation is known as “Compact”, meaning that it only uses as much memory as is absolutely required7. It may also be possible to encode each of the coin tosses using 8 bit ASCII characters, ‘H’ and ‘T’, but in doing so we need to use 8 bits to represent each head and 8 more bits to represent each tail.  Thus the sequence of 10 coin tosses now requires 80 bits rather than the 10 bits using the more efficient encoding.

In computer science, this “Compact” representation is denoted by the letter Z, therefore a Compact data structure is one that uses O(Z) bits of space. An implicit data structure uses Z + O(1) bits of space, which is to say that it uses Z bits to store the data plus a constant number of additional bits.  A succinct data structure uses Z + o(Z) bits, meaning that may use the information theoretical minimum number of bits plus some extra data that is less than the information theoretical minimum number of bits.  That is, for a data structure to be succinct, it must not use more than two times the information theoretic minimum number of bits.  Common Succinct data structures are rank / select dictionaries which underpin the Gossamer de-novo assembler4.

Information theory can also be applied to bioinformatic data sets.  Take for example a collection of next generation sequencing reads which have been aligned to a reference genome.  Each one of these reads will be associated with a mapping quality score, MAPQ, which is an 8-bit integer value corresponding to the -10 log10( Pr(mapping position is wrong)).2  While this score may take on any value from 0-255, it is rare for it to actually utilize all of those values.  Therefore we have a mismatch between the number of bits that are being used to store this value and the number of bits required to store this value.

Take for example the following list of ten MAPQ scores:

100, 20, 30, 100, 30, 20, 100, 100, 30, 30

We could encode the list of numbers as 8 bit integers and require 80 bits to store them.  Alternatively we could note that there are only 3 distinct values in the list (20, 30, 100).  Therefore, if we store the distinct values in a separate list (the keys list), we only need to store 3 bytes (24 bits).

Thus we can create a mapping between the actual values (20, 30, 100) and the index of the values.  Since the indexes start at 0 and skips no numbers, we can encode it with fewer bits (in this case two bits).

Keys

20

30

100

Index

0

1

2

 

 

Using this table, it is possible to link each value in our list to its Key by using its index.  This creates the following table:

Value

100

20

30

100

30

20

100

100

30

30

Key Position

2

0

1

2

1

0

2

2

1

1


Since the key positions are limited to being from 0 to 2, we can encode each link to a key in 2 bits.  This means that we can store the 3 keys using 8 bits each (24 bits in total), plus store the links in 2 bits each (2 * 10 = 20 bits in total).  That means that rather than storing the sequence of 10 MAPQ scores in 80 bits, we only need 44 bits to store this sequence of numbers, saving nearly 50% of the space required.


Vector versus Row Orientation

The previous example demonstrates how a series of values (a vector) can be encoded in a space efficient manner.  The data generated from sequence alignments is actually tabular, with the alignment of each read being a row and its quality metrics, position, and insert size all being fields belonging to the row.  One way of looking at this problem is to decompose the structure into being a vector of vectors, or a list of lists.  There are two ways this can be decomposed, such that a table is a list of columns or a list of rows.  The following table demonstrates this concept.

Position

MAPQ

PHRED_AVG

10001

100

25

10002

20

25

10003

30

25

10004

20

25

 

This data can either be stored in a row oriented manner, where the data is stored:

10001, 100, 25, 10002, 20, 25, 10003, 30, 25, 10004, 20, 25

Or it can be stored in a column oriented manned, where the data is stored:

10001, 10002, 10003, 10004, 100, 20, 30, 20, 25, 25, 25, 25

Both storage layouts require the same amount of space when stored in a compact manner (such as an vector of vectors), but have differing access patterns.  The key issue with the row based approach is that generally the entire set of data needs to be scanned even if you are only looking for values from one or two columns.  Take the example where we want to return all positions where the MAPQ is 30.  In order to find this we need to test the second array position of all three rows, returning the position where the condition is true.  Testing the second field of each row means jumping 2 fields at each row, which while simple from a programming perspective, is less efficient for a computer as it means its caches can be less well utilized as only 1 of the three fields stored in each cache line is actually being used.  Processors nowadays are fast enough that cache misses (also known as a “stall”) represent a major source of increased latency, as the processor waits for data to be retrieved from main memory8. Alternatively, storing the values in a column oriented approach means that testing each value only means jumping to the next record, which is more likely to be in the cache (since data is loaded into the caches from main memory in sequential chunks).  This means that the CPU’s caches will be X time better utilized (where X is the number of columns), making a column oriented approach faster when scanning values sequentially.

A column oriented approach is more complex when finding values across multiple columns.  For example, if we want to find the Position of the reads with a MAPQ score of 20 and a PHRED_AVG of 25.  Finding the values in a column oriented data structure means scanning the MAPQ column and then scanning the PHRED_AVG and taking the intersection of the two columns.  For example, the two columns have the matches at the following rows:

MAPQ:                         (1, 3)
PHRED_AVG:             (0, 1, 2, 3) 

The intersection of the two is (1, 3), and therefore which correspond to positions 10002 and 10004.  Another way to handle this process is not to build a list of the indexes where the values are found but rather to encode the matches in a bitmap, where a 0 represents no match, and a 1 represents a match.  Thus the bitmaps representing the same data as above now look like:

MAPQ:                         (0, 1, 0, 1)
PHRED_AVG:             (1, 1, 1, 1) 

Calculating the row where both the conditions are true now becomes a simple logical AND between the two bitmaps.  On modern processors, bitwise operations are incredibly quick, with modern processors able to calculate bitwise AND over 256 bits in a single clock cycle using Intel’s AVX extensions on “Sandy Bridge” processors7.  With processor clock speeds in the gigahertz, this enable means that theoretically a modern processor could process hundreds of billions of bits per second.  Here the processors ability to move data from main memory into registers is the clear bottleneck.  This illustrates how important predictable memory access is and how important it is to keep the processors various caches full.

Due to the speed at which modern processors can manipulate bitmaps using logical operations, we want to employ them as much as possible.  This leads to a novel structure for storing the vectors for each column.

Recall that we can encode the MAPQ scores in 2 bits given that there are only 3 possible values that are seen in the data.  This means that the following are all equivalent representations:

Values

Lookup

Lookup (Bits)

100

2

10

20

0

00

30

1

01

20

0

00


Here “Lookup (Bits)” is just a representation of Lookup in bits in binary code.  This also means that the following representations are equivalent:

 

Row 0

Row 1

Row 2

Row 3

 

10

00

01

00

 

 

 

 

 

 

Row 0

Row 1

Row 2

Row 3

First bits

1

0

0

0

Last bits

0

0

1

0

 

Where we break up each of the lookup bit’s into their most significant bits and least significant bits.  So rather than encoding all the data as: [10 00 01 00], it is coded as [1 0 0 0] (First bits) and [0 0 1 0] (Last bits). The reason for this is to enable comparisons to utilize the same fast intersection technique as previously mentioned. To do this, we first must find the corresponding value in our keys list, and use its index to calculate its Lookup Bitmap (e.g. Looking up “30” tells us that the Lookup Bits is [01]).  We then interrogate each bit within the lookup bits, finding that the first bit is a zero.  This means that we need to calculate the bitwise NOT of the corresponding bit vector (First bits in this case).  This gives us all the positions where the First bits of the lookup value is the same as that of the values vector.  We can then repeat the process for the Last bits bit vector (skipping the NOT since the second bit in the lookup key is a “1”) and take the bitwise AND of both vectors.

If for example, we are looking for all the positions where the Value is 30 (which corresponds to 0 1), we can find the bitmap which encodes the positions where this condition is true by taking the logical NOT of the First bits bitmap (since the first bit of the search value is 0) and then to AND it with the values in the Last bits bitmap. This is more rigorously defined in the following piece of C code used in this research:

u64 * tny_page_seek_and(tny_page *page, int keyIndex, u64 * restrict result) { 
    // Define an array of whether the key is a 0 or 1 at each bit position
    int keyBitMask[page->depth];
  
    // Fill the key bit mask
    for (int j = 0; j < page->depth; j++) {
        keyBitMask[j] = (keyIndex & (1 << j));
    }

    // Each bit level has its own array (page->data) of words corresponding to
    // the bits of the lookup keys at that level.  MSB is the first, LSB the last
    for (int d = 0; d < page->depth; d++) {
        u64 * restrict cur = page->data[d];
        if (keyBitMask[d] == 0) {
            // If it’s a zero, we need to do a NOT before the AND
            // * NOTE that COLPAGE_LENGTH_WORDS is the number of values in this
            // vector (data is arranged in pages of a fixed number of values)
            for (int w = 0; w < COLPAGE_LENGTH_WORDS; w++)
                result[w] &= ~cur[w];
        } else {
            // If it’s a one, we can just do the AND
            for (int w = 0; w < COLPAGE_LENGTH_WORDS; w++)
                result[w] &= cur[w];
        }
    }
    return result;
}

 

The result is the bitmap containing the positions where the lookup key matches the stored values.  This bitmap is then ready for combining with the result of lookups on other columns. In this way, searching for values is decomposed into a massive series of logical AND operations which can be completed as quickly as the memory can be transferred to the CPU, which should be in the order of many gigabits per second. 

This technique of data storage is “compact” as, excluding the list of keys, the actual storage of the key positions attains the information theoretical lower bound.  However, this reduced size does mean that no indexing or sorting is done, meaning that data must be scanned sequentially rather than being able to utilize hashing or binary search like algorithms.  Of course this technique also has the draw-back of needing to manage long bitmaps which grow with the length of the input.  In order to manage this, we break up the bitmaps into “pages” of a set number of values (in this research, each page contains 1 million values). As pages are independent, they are very well suited to utilizing multiple processors and even multiple computers – although this has not been explored in this research.

Methods

The implementation of succinct data structures to support genomic analysis breaks down into two distinct categories:

·       Determining what data is required to detect each type of mutation

·       Determining the optimal data structure for the data to support finding the specific class of mutation.

In order to limit the scope of this work (so that there is not the requirement to implement a SNP caller, SV caller and small indel caller), this paper will present a proof of concept around calling copy number using read-depth (the number of NGS reads covering a particular base on the genome).  A similar process could be utilized for calling the other classes of mutation.

The experiment will have two phases. 

The first phase will be to compare calculating copy number directly from the bam file to first extracting only the required information into a naïve secondary data structure (text file) and analysing that.

The second phase will compare calculating copy number from the secondary data structure (text file) to calculating copy number from a tertiary data structure (succinct file).  The time to generate and space used to hold both the text file and succinct file will also be calculated. 

All tests will be done on chromosome 22 of a 1000 Genomes samples sequenced using SOLiD sequencing.  The resultant BAM file is approximately 1.2GB in length and contains 9,497,641 reads.

For the purpose of this experiment, the following fields from the BAM file were deemed to be important:

Position       The position in the genome the read maps to

Length         The length of the read mapping to the reference

Quality_Read   The Average phred quality of the bases in the read

Quality_Map    The MAPQ score (quality of the alignment)

Cigar_Match    The number of bases in the read matching the reference

Cigar_Insert   The number of bases that had to be inserted to map the alignment

Cigar_Deletion The number of bases that had to be deleted  to map the alignment

Cigar_SoftClip The number of bases that had to be clipped from the ends of the reads

Cigar_HardClip The number of bases that had to be clipped from the ends of the reads

The key here is that quality metrics need to be included within downstream analysis so that bioinformaticians can experiment with different thresholds depending on the profile of the data and nature of the analysis.

Copy Number Calculation Algorithm

Where multiple copies of a segment of DNA exist, we expect that segment to generate more reads than the areas on either side of it when sequenced.  

As this research is focussed on testing data structures to handle the data deluge, we will use a naïve algorithm that walks through the genome calculating the number of reads that start inside a set window.  This is clearly not as advanced as other algorithms that exist, but will give us a simple workload that can be utilized to determine the performance characteristics of the different data structures.  For a more statistically sound approach to copy number detection within NGS data, please see Xie, Tammy9.


Results & Discussion

The test conducted asks the algorithms to count the number of reads that fall in the million base pairs between positions 16,000,000bp and 17,000,000bp within a window size of 10,000bp.  A second test was also run over 10 million base pairs.

Space utilization

File

MB

BAM

1,260

Text file

259

Compact

106

 

Here we see that the raw BAM is about 1.2GB in size.  By extracting just the fields we are interested in and dumping the results to a text file, we remove about a gigabyte of unrequired space.  The succinct structure further reduces the amount of data required, meaning now that less than one tenth of the data needs to be queried.

Structure generation

There is an obvious overhead associated with additional data structures in that they need to be built from the source data.

File

Time (s)

Text file

374

Compact

153


Surprisingly the succinct data structure is much faster to build than outputting the text file.  This may be an anomaly due to the fact that the succinct version writes directly to its output files using file handles, whereas the Text file version writes the read information to stdout which is then redirected to a text file.

 Load performance

Both the text file version and the succinct version are designed to be loaded into memory once and queried multiple times with different parameters.  Therefore this metric is not so important, but is included for completeness.

File

Time (s)

Text file

51.7

Compact

3.5


Here we see that the succinct structure is very much faster to load than the raw text file.  The is due to two reasons, firstly that the text file is more than twice the size and therefore there is more disk IO to be done.  Secondly, in reading in the text file, the values in the file must be parsed with the overhead that entails.  This overhead is quite large, as the language used is golang, which is garbage collected meaning that object handles are only freed at certain times, meaning that the application can be holding a lot more memory than it absolutely required.  The Compact structure in the file is laid out very similarly to how the data is structured in memory, so very little processing is required to get the data into memory ready for querying.

Query performance

The most important metric is the time taken by queries, as the goal is to improve the turnaround time for genomic analysis.

File

1M (s)

10M (s)

Memory (mb)

BAM direct - no index

399.396

394.000

1.9

BAM direct - index

3.139

98.642

2.1

Text file

0.552

5.880

1320.0

Compact

0.132

1.475

123.0

 
There is an enormous difference between using the indexed BAM file and the non-indexed bam file, as the index dramatically reduces the amount of data that needs to be read from disk and processed.  It is also worthwhile noting how little memory is used by the raw BAM approach, as everything is read directly from disk using memory structures that are re-used for each read and only the counts for each window need to be retained from read to read.

Both the text file and succinct structure perform dramatically better than even the indexed BAM file.  Over the 10 million base pairs, the succinct data structure version actually completed 1,000 independent queries in the 1.475 seconds, meaning each query only requires 1.4ms.  This time could also be dramatically shortened through further optimization.

Of note is the Text File version’s memory usage.  As mentioned earlier, this appears to be a garbage collection issue due to the way that the values are parsed from the text file.  As the tests were run on a virtual machine with only 2GB of memory, there was a huge amount of fluctuation in the query performance, depending on whether or not the tool decided if it should run garbage collection during the query or not and when it was too late in running garbage collection the VM would run out of memory, requiring the paging of virtual memory and performance would blow out to over 30 seconds.  The memory usage of the succinct version was very low due to the efficient memory layout of its data structures and the lack of processing required to bring the data files into memory. 

Conclusion

The utilization of memory efficient data structures has clear advantages over existing formats in terms of space utilization and querying performance.  Given how much time can be wasted while bioinformatician’s wait for results to be calculated the usage of more efficient data structures has the ability to dramatically improve data analysis workflows.  There is also much work that can be done to optimize the structures used, both in terms of exploiting parallelism as well as in the utilization of indexes on top of these compact structures.

The source-code for this project is available on request.


References

1. Comprehensive molecular portraits of human breast tumours. Nature. Sep 23 2012;490(7418):61-70.

2. Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. Aug 15 2009;25(16):2078-2079.

3. Gonzalez R, Grabowski, S., Makinen, V., and Navarro, G. Practical implementation of rank and select queries. Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA'05). 2005:27-38.

4. Conway T, Wazny J, Bromage A, Zobel J, Beresford-Smith B. Gossamer--a resource-efficient de novo assembler. Bioinformatics. Jul 15 2012;28(14):1937-1938.

5. Levinthal D. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors 2009.

6. Vigna S. Broadword Implementation of Rank/Select Queries. Lecture Notes in Computer Science. 2008.

7. Intel. Intel® Architecture Instruction Set Extensions Programming Reference. August, 2012 2012.

8. Drepper U. What every programmer should know about memory. LWN. 2007.

9. Xie C, Tammi MT. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics. 2009;10:80.