Overcoming the data deluge of Next Generation Sequencing with Compact Data Structures

Overcoming the data deluge of Next Generation Sequencing with Compact Data Structures


Large scale sequencing projects such as the 1000 Genomes project, Cancer Genome Atlas Project (TCGA) and the International Cancer Genome Consortium (ICGC) seek to sequence the genomes of thousands of samples.  A single study into breast cancer by the TCGA itself sequenced the whole exomes of 507 samples1.  As the cost of sequencing a whole genome drops to near $1,000, this same type of research is going to become even more accessible to researchers.  However the data generated by these experiments pose unique problems for researchers due to the volume and the complexity of the required analysis. 

The majority of analysis conducted within re-sequencing projects is in determining how the sequenced sample differs from a reference genome.  As such, the majority of analysis is conducted on the mapped sequence files which contain information on how each read generated by the sequencer maps back to a reference genome. The mapping files enable the detection of different types of mutations, including single nucleotide polymorphisms (SNP’s), Copy Number Variation (CNV) and structural variations.  These files utilize a file format called Sequence Alignment/Map2 (SAM) which provides a text format for the information.  A more space efficient binary representation of this format is also available (BAM).

Due to the often hundreds of millions or billions of reads generated by next generation sequencing machinery for a single sample, these files are usually in order of hundreds of gigabytes in size.  They are therefore extremely difficult to work with without significant computing resources.  Simply copying the data from a single sample can take hours, let alone common operations such as merging, indexing and sorting.  This problem is even more serious in cancer studies where the standard is to sequence not only the genomes of tumour tissue, but also to sequence surrounding healthy tissue so as to provide a baseline for further comparison – effectively doubling the amount of data generated.

Thankfully, sequencing data is static, meaning that once it is written it does not change.  This makes it well suited to compression and data structures that are optimized for space rather than for updates and changes.  Succinct and compact data structures store data at near to the minimum entropy bounds (the theoretical minimum amount of data required to express the information), whilst still allowing for efficient access to the underlying data3.  Succinct data structures have already been utilized within genomics, with a short-read de-novo assembler, Gossamer utilizing them heavily4.  If it is possible to represent the information stored in BAM files using data structures optimized for the types of analysis typically required, then it may be possible to significantly improve the ease and cost at which analysis on genomic and transcriptomic material can be conducted.


Use cases for analysing genomic material

When identifying mutations in next generation sequencing data, the different classes of mutations are identified in different ways and require different subsets of the data stored within the alignment file. Therefore by extracting only the information that is useful for these specific types of analysis, a much smaller amount of data can be analysed, thus enabling much faster and cost effective data analysis.  For example, the individual base quality ratings may be useful for SNP calling, but are much less important for calling structural variations.  Since one of the easiest ways to make a computation faster is to reduce the amount of data that to be processed, extracting only the data required to identify each class of mutation, the computation requirements can be dramatically reduced.

It may be argued that by extracting the information for each type of analysis you are reading the entire file and therefore, since reading the file is a major overhead, you might as well do the analysis as you read.  The problem with this line of argument is that it does not reflect how bioinformatic analysis is conducted.  Most bioinformatic tools have many, many parameters which need to be tuned to the specific data set available.  For example, the read-depth of a sample may affect how many reads are discarded due to being considered duplicates.  The reality is that for most analysis, tools need to be run many times over the same data set before producing results that are useful.  To this end, doing a single read of the original file to produce smaller data structures suited to specific analysis tasks makes sense, as the initial overhead in generating the new structures will more than be made up for over repeated runs of downstream tools.

The data structures utilized to store the relevant information for each use case may also be optimized for query performance and size.  Even though these structures will house only a subset of the data contained in the BAM, the structures may still need to store billions of values.  It is important that these structures support efficient query access to ensure that unnecessary work is not being computed (e.g. checking each value may not be required if the results are sorted where a binary search may be possible).

In order to assess the benefit of smaller data structures, it is important to consider the architecture of modern computers.  Unfortunately modern CPU architectures are very complex, but the following table illustrates the approximate different levels of latency for accessing the different levels of cache within the Intel i7 Xeon range of processors 5.













Main memory







Here we see that the latency of access varies widely and that the memory with the smallest capacity is the fastest by many orders of magnitude.  Therefore the more compact our data structures can be, the better the performance of our analysis as they become more likely to fit into a better performing level of cache.

Succinct, Implicit and Compact data structures

There is a lower bound on the number of bits required to be able to store information.  This is known as the “information-theoretical lower bound”6 and is dependent on the possible number of values a value can take and the number of values that are to be stored.  For example, to encode the outcome of the tosses of a coin the possible number of values to record is two, heads and tails.  Therefore, each toss of the coin can be represented using 1 bit, using “1” to represent a head and “0” to represent a tail.  A sequence of ten head tosses could therefore be represented as 00 10 01 10 01, using ten bits.  This representation is known as “Compact”, meaning that it only uses as much memory as is absolutely required7. It may also be possible to encode each of the coin tosses using 8 bit ASCII characters, ‘H’ and ‘T’, but in doing so we need to use 8 bits to represent each head and 8 more bits to represent each tail.  Thus the sequence of 10 coin tosses now requires 80 bits rather than the 10 bits using the more efficient encoding.

In computer science, this “Compact” representation is denoted by the letter Z, therefore a Compact data structure is one that uses O(Z) bits of space. An implicit data structure uses Z + O(1) bits of space, which is to say that it uses Z bits to store the data plus a constant number of additional bits.  A succinct data structure uses Z + o(Z) bits, meaning that may use the information theoretical minimum number of bits plus some extra data that is less than the information theoretical minimum number of bits.  That is, for a data structure to be succinct, it must not use more than two times the information theoretic minimum number of bits.  Common Succinct data structures are rank / select dictionaries which underpin the Gossamer de-novo assembler4.

Information theory can also be applied to bioinformatic data sets.  Take for example a collection of next generation sequencing reads which have been aligned to a reference genome.  Each one of these reads will be associated with a mapping quality score, MAPQ, which is an 8-bit integer value corresponding to the -10 log10( Pr(mapping position is wrong)).2  While this score may take on any value from 0-255, it is rare for it to actually utilize all of those values.  Therefore we have a mismatch between the number of bits that are being used to store this value and the number of bits required to store this value.

Take for example the following list of ten MAPQ scores:

100, 20, 30, 100, 30, 20, 100, 100, 30, 30

We could encode the list of numbers as 8 bit integers and require 80 bits to store them.  Alternatively we could note that there are only 3 distinct values in the list (20, 30, 100).  Therefore, if we store the distinct values in a separate list (the keys list), we only need to store 3 bytes (24 bits).

Thus we can create a mapping between the actual values (20, 30, 100) and the index of the values.  Since the indexes start at 0 and skips no numbers, we can encode it with fewer bits (in this case two bits).











Using this table, it is possible to link each value in our list to its Key by using its index.  This creates the following table:












Key Position











Since the key positions are limited to being from 0 to 2, we can encode each link to a key in 2 bits.  This means that we can store the 3 keys using 8 bits each (24 bits in total), plus store the links in 2 bits each (2 * 10 = 20 bits in total).  That means that rather than storing the sequence of 10 MAPQ scores in 80 bits, we only need 44 bits to store this sequence of numbers, saving nearly 50% of the space required.

Vector versus Row Orientation

The previous example demonstrates how a series of values (a vector) can be encoded in a space efficient manner.  The data generated from sequence alignments is actually tabular, with the alignment of each read being a row and its quality metrics, position, and insert size all being fields belonging to the row.  One way of looking at this problem is to decompose the structure into being a vector of vectors, or a list of lists.  There are two ways this can be decomposed, such that a table is a list of columns or a list of rows.  The following table demonstrates this concept.

















This data can either be stored in a row oriented manner, where the data is stored:

10001, 100, 25, 10002, 20, 25, 10003, 30, 25, 10004, 20, 25

Or it can be stored in a column oriented manned, where the data is stored:

10001, 10002, 10003, 10004, 100, 20, 30, 20, 25, 25, 25, 25

Both storage layouts require the same amount of space when stored in a compact manner (such as an vector of vectors), but have differing access patterns.  The key issue with the row based approach is that generally the entire set of data needs to be scanned even if you are only looking for values from one or two columns.  Take the example where we want to return all positions where the MAPQ is 30.  In order to find this we need to test the second array position of all three rows, returning the position where the condition is true.  Testing the second field of each row means jumping 2 fields at each row, which while simple from a programming perspective, is less efficient for a computer as it means its caches can be less well utilized as only 1 of the three fields stored in each cache line is actually being used.  Processors nowadays are fast enough that cache misses (also known as a “stall”) represent a major source of increased latency, as the processor waits for data to be retrieved from main memory8. Alternatively, storing the values in a column oriented approach means that testing each value only means jumping to the next record, which is more likely to be in the cache (since data is loaded into the caches from main memory in sequential chunks).  This means that the CPU’s caches will be X time better utilized (where X is the number of columns), making a column oriented approach faster when scanning values sequentially.

A column oriented approach is more complex when finding values across multiple columns.  For example, if we want to find the Position of the reads with a MAPQ score of 20 and a PHRED_AVG of 25.  Finding the values in a column oriented data structure means scanning the MAPQ column and then scanning the PHRED_AVG and taking the intersection of the two columns.  For example, the two columns have the matches at the following rows:

MAPQ:                         (1, 3)
PHRED_AVG:             (0, 1, 2, 3) 

The intersection of the two is (1, 3), and therefore which correspond to positions 10002 and 10004.  Another way to handle this process is not to build a list of the indexes where the values are found but rather to encode the matches in a bitmap, where a 0 represents no match, and a 1 represents a match.  Thus the bitmaps representing the same data as above now look like:

MAPQ:                         (0, 1, 0, 1)
PHRED_AVG:             (1, 1, 1, 1) 

Calculating the row where both the conditions are true now becomes a simple logical AND between the two bitmaps.  On modern processors, bitwise operations are incredibly quick, with modern processors able to calculate bitwise AND over 256 bits in a single clock cycle using Intel’s AVX extensions on “Sandy Bridge” processors7.  With processor clock speeds in the gigahertz, this enable means that theoretically a modern processor could process hundreds of billions of bits per second.  Here the processors ability to move data from main memory into registers is the clear bottleneck.  This illustrates how important predictable memory access is and how important it is to keep the processors various caches full.

Due to the speed at which modern processors can manipulate bitmaps using logical operations, we want to employ them as much as possible.  This leads to a novel structure for storing the vectors for each column.

Recall that we can encode the MAPQ scores in 2 bits given that there are only 3 possible values that are seen in the data.  This means that the following are all equivalent representations:



Lookup (Bits)













Here “Lookup (Bits)” is just a representation of Lookup in bits in binary code.  This also means that the following representations are equivalent:


Row 0

Row 1

Row 2

Row 3












Row 0

Row 1

Row 2

Row 3

First bits





Last bits






Where we break up each of the lookup bit’s into their most significant bits and least significant bits.  So rather than encoding all the data as: [10 00 01 00], it is coded as [1 0 0 0] (First bits) and [0 0 1 0] (Last bits). The reason for this is to enable comparisons to utilize the same fast intersection technique as previously mentioned. To do this, we first must find the corresponding value in our keys list, and use its index to calculate its Lookup Bitmap (e.g. Looking up “30” tells us that the Lookup Bits is [01]).  We then interrogate each bit within the lookup bits, finding that the first bit is a zero.  This means that we need to calculate the bitwise NOT of the corresponding bit vector (First bits in this case).  This gives us all the positions where the First bits of the lookup value is the same as that of the values vector.  We can then repeat the process for the Last bits bit vector (skipping the NOT since the second bit in the lookup key is a “1”) and take the bitwise AND of both vectors.

If for example, we are looking for all the positions where the Value is 30 (which corresponds to 0 1), we can find the bitmap which encodes the positions where this condition is true by taking the logical NOT of the First bits bitmap (since the first bit of the search value is 0) and then to AND it with the values in the Last bits bitmap. This is more rigorously defined in the following piece of C code used in this research:

u64 * tny_page_seek_and(tny_page *page, int keyIndex, u64 * restrict result) { 
    // Define an array of whether the key is a 0 or 1 at each bit position
    int keyBitMask[page->depth];
    // Fill the key bit mask
    for (int j = 0; j < page->depth; j++) {
        keyBitMask[j] = (keyIndex & (1 << j));

    // Each bit level has its own array (page->data) of words corresponding to
    // the bits of the lookup keys at that level.  MSB is the first, LSB the last
    for (int d = 0; d < page->depth; d++) {
        u64 * restrict cur = page->data[d];
        if (keyBitMask[d] == 0) {
            // If it’s a zero, we need to do a NOT before the AND
            // * NOTE that COLPAGE_LENGTH_WORDS is the number of values in this
            // vector (data is arranged in pages of a fixed number of values)
            for (int w = 0; w < COLPAGE_LENGTH_WORDS; w++)
                result[w] &= ~cur[w];
        } else {
            // If it’s a one, we can just do the AND
            for (int w = 0; w < COLPAGE_LENGTH_WORDS; w++)
                result[w] &= cur[w];
    return result;


The result is the bitmap containing the positions where the lookup key matches the stored values.  This bitmap is then ready for combining with the result of lookups on other columns. In this way, searching for values is decomposed into a massive series of logical AND operations which can be completed as quickly as the memory can be transferred to the CPU, which should be in the order of many gigabits per second. 

This technique of data storage is “compact” as, excluding the list of keys, the actual storage of the key positions attains the information theoretical lower bound.  However, this reduced size does mean that no indexing or sorting is done, meaning that data must be scanned sequentially rather than being able to utilize hashing or binary search like algorithms.  Of course this technique also has the draw-back of needing to manage long bitmaps which grow with the length of the input.  In order to manage this, we break up the bitmaps into “pages” of a set number of values (in this research, each page contains 1 million values). As pages are independent, they are very well suited to utilizing multiple processors and even multiple computers – although this has not been explored in this research.


The implementation of succinct data structures to support genomic analysis breaks down into two distinct categories:

·       Determining what data is required to detect each type of mutation

·       Determining the optimal data structure for the data to support finding the specific class of mutation.

In order to limit the scope of this work (so that there is not the requirement to implement a SNP caller, SV caller and small indel caller), this paper will present a proof of concept around calling copy number using read-depth (the number of NGS reads covering a particular base on the genome).  A similar process could be utilized for calling the other classes of mutation.

The experiment will have two phases. 

The first phase will be to compare calculating copy number directly from the bam file to first extracting only the required information into a naïve secondary data structure (text file) and analysing that.

The second phase will compare calculating copy number from the secondary data structure (text file) to calculating copy number from a tertiary data structure (succinct file).  The time to generate and space used to hold both the text file and succinct file will also be calculated. 

All tests will be done on chromosome 22 of a 1000 Genomes samples sequenced using SOLiD sequencing.  The resultant BAM file is approximately 1.2GB in length and contains 9,497,641 reads.

For the purpose of this experiment, the following fields from the BAM file were deemed to be important:

Position       The position in the genome the read maps to

Length         The length of the read mapping to the reference

Quality_Read   The Average phred quality of the bases in the read

Quality_Map    The MAPQ score (quality of the alignment)

Cigar_Match    The number of bases in the read matching the reference

Cigar_Insert   The number of bases that had to be inserted to map the alignment

Cigar_Deletion The number of bases that had to be deleted  to map the alignment

Cigar_SoftClip The number of bases that had to be clipped from the ends of the reads

Cigar_HardClip The number of bases that had to be clipped from the ends of the reads

The key here is that quality metrics need to be included within downstream analysis so that bioinformaticians can experiment with different thresholds depending on the profile of the data and nature of the analysis.

Copy Number Calculation Algorithm

Where multiple copies of a segment of DNA exist, we expect that segment to generate more reads than the areas on either side of it when sequenced.  

As this research is focussed on testing data structures to handle the data deluge, we will use a naïve algorithm that walks through the genome calculating the number of reads that start inside a set window.  This is clearly not as advanced as other algorithms that exist, but will give us a simple workload that can be utilized to determine the performance characteristics of the different data structures.  For a more statistically sound approach to copy number detection within NGS data, please see Xie, Tammy9.

Results & Discussion

The test conducted asks the algorithms to count the number of reads that fall in the million base pairs between positions 16,000,000bp and 17,000,000bp within a window size of 10,000bp.  A second test was also run over 10 million base pairs.

Space utilization





Text file





Here we see that the raw BAM is about 1.2GB in size.  By extracting just the fields we are interested in and dumping the results to a text file, we remove about a gigabyte of unrequired space.  The succinct structure further reduces the amount of data required, meaning now that less than one tenth of the data needs to be queried.

Structure generation

There is an obvious overhead associated with additional data structures in that they need to be built from the source data.


Time (s)

Text file




Surprisingly the succinct data structure is much faster to build than outputting the text file.  This may be an anomaly due to the fact that the succinct version writes directly to its output files using file handles, whereas the Text file version writes the read information to stdout which is then redirected to a text file.

 Load performance

Both the text file version and the succinct version are designed to be loaded into memory once and queried multiple times with different parameters.  Therefore this metric is not so important, but is included for completeness.


Time (s)

Text file




Here we see that the succinct structure is very much faster to load than the raw text file.  The is due to two reasons, firstly that the text file is more than twice the size and therefore there is more disk IO to be done.  Secondly, in reading in the text file, the values in the file must be parsed with the overhead that entails.  This overhead is quite large, as the language used is golang, which is garbage collected meaning that object handles are only freed at certain times, meaning that the application can be holding a lot more memory than it absolutely required.  The Compact structure in the file is laid out very similarly to how the data is structured in memory, so very little processing is required to get the data into memory ready for querying.

Query performance

The most important metric is the time taken by queries, as the goal is to improve the turnaround time for genomic analysis.


1M (s)

10M (s)

Memory (mb)

BAM direct - no index




BAM direct - index




Text file








There is an enormous difference between using the indexed BAM file and the non-indexed bam file, as the index dramatically reduces the amount of data that needs to be read from disk and processed.  It is also worthwhile noting how little memory is used by the raw BAM approach, as everything is read directly from disk using memory structures that are re-used for each read and only the counts for each window need to be retained from read to read.

Both the text file and succinct structure perform dramatically better than even the indexed BAM file.  Over the 10 million base pairs, the succinct data structure version actually completed 1,000 independent queries in the 1.475 seconds, meaning each query only requires 1.4ms.  This time could also be dramatically shortened through further optimization.

Of note is the Text File version’s memory usage.  As mentioned earlier, this appears to be a garbage collection issue due to the way that the values are parsed from the text file.  As the tests were run on a virtual machine with only 2GB of memory, there was a huge amount of fluctuation in the query performance, depending on whether or not the tool decided if it should run garbage collection during the query or not and when it was too late in running garbage collection the VM would run out of memory, requiring the paging of virtual memory and performance would blow out to over 30 seconds.  The memory usage of the succinct version was very low due to the efficient memory layout of its data structures and the lack of processing required to bring the data files into memory. 


The utilization of memory efficient data structures has clear advantages over existing formats in terms of space utilization and querying performance.  Given how much time can be wasted while bioinformatician’s wait for results to be calculated the usage of more efficient data structures has the ability to dramatically improve data analysis workflows.  There is also much work that can be done to optimize the structures used, both in terms of exploiting parallelism as well as in the utilization of indexes on top of these compact structures.

The source-code for this project is available on request.


1. Comprehensive molecular portraits of human breast tumours. Nature. Sep 23 2012;490(7418):61-70.

2. Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. Aug 15 2009;25(16):2078-2079.

3. Gonzalez R, Grabowski, S., Makinen, V., and Navarro, G. Practical implementation of rank and select queries. Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA'05). 2005:27-38.

4. Conway T, Wazny J, Bromage A, Zobel J, Beresford-Smith B. Gossamer--a resource-efficient de novo assembler. Bioinformatics. Jul 15 2012;28(14):1937-1938.

5. Levinthal D. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors 2009.

6. Vigna S. Broadword Implementation of Rank/Select Queries. Lecture Notes in Computer Science. 2008.

7. Intel. Intel® Architecture Instruction Set Extensions Programming Reference. August, 2012 2012.

8. Drepper U. What every programmer should know about memory. LWN. 2007.

9. Xie C, Tammi MT. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics. 2009;10:80.


Big Data: Correlation is no substitute for causation (or a decent model)

I have been working with “Big Data” for the past seven or so years, working with financial markets data, credit card transactions, retail transactions and cancer genomics.  I have never really liked the term “Big Data”, as when I first stumbled across it, I found it to be rather vacuous, but it seems like the last couple of years has seen the term settle on the definition that “Big Data is where the size of that data becomes part of the problem”.

It with great trepidation that I downloaded “Big Data: A revolution that will transform how we live, work and think” to my kindle, but with 106 reviews and a 4.3 star rating I felt I might as well give it a go.

I definitely agree with the book when it says that “At its core, big data is about predictions”.  In each of the areas I have worked in the key motivation for analysis has been to answer questions that can be acted upon.  For the most part, this means making predictions – what customer segment is most likely to respond to direct mail, what drug is a patient’s tumour most likely to respond to, etc. 

However, the book also makes an assertion that I do not agree with, that “In a big-data world, by contrast, we won’t have to be fixated on causality, instead we can discover patterns and correlations in the data that offer us novel and invaluable insights”. Later, an example is given: “If millions of electronic medical records reveal that cancer sufferers who take a certain combination of aspirin and orange juice see their disease go into remission, then the exact cause of their improvement in health may be less important than the fact they lived”.

I could not disagree with this statement more.

Nate Silver’s book, “The Signal and the Noise” and pretty much any statistician worth hiring make it pretty clear that this type of reasoning is wrong and dangerous.  What we are attempting to do when analyzing data is to build a model of how a system works so that we can then make accurate predictions. 

The most important aspect of this is determining what is signal and what is noise.  Just because you see a pattern or correlation in a set of data, does not mean that correlation or pattern will continue – it’s likely just to be an artifact of something else: noise or bias.  Only when you can build and test a model (e.g. cross validation, using an additional independent data set) can you discover if the relationship is likely to be real or not.

The problem is only compounded by big data, as large volumes of data mean more noise.  A phenomenon that only occurs 0.001% of the time will occur 1,000 times if you have a million samples.  This means that the more data you have, the more “patterns” and “correlations” you are likely to discover that are spurious.

This problem has long been known about, and is especially true for medical research involving micro-arrays.  A micro-array is a device with millions of probes that lets you determine if a frequently occurring DNA mutation exists within a particular person.  With 500,000 – 1 million probes, this lets you look for a lot of mutations.  So a study design, known as a GWAS was created which took large groups of people affected by a disease (e.g. 10,000 people) and another large group of controls (e.g. 20,000 people) and looked for what DNA mutation markers were more common in the affected population than the control population in an attempt to discover what areas of the genome may be playing a role in disease susceptibility and prognosis.

The problem is that there are so many probes, that it is trivial to find a probe (out of the say, 500,0000) that is more common in one group than the other.  In fact, you would expect to see 25,000 if your significance threshold is p < 0.05.  Even when you use bonferonni correction, and your p-value for significance drops to 0.0000001, you still end up with a lot of spurious results.  This happened so often, that journals such as Nature Genetics began demanding that all GWAS were required to be replicated with an independent study.  Even with these stringent checks, in areas such as blood pressure, GWAS have provided little insight into the effect of genetics on the disease as summarized by Manolio in his paper “Finding the missing heritability of complex diseases”:

Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases and traits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relatively small increments in risk, and explain only a small proportion of familial clustering, leading many to question how the remaining, ‘missing’ heritability can be explained

 Nature 461, 747-753 (8 October 2009) | doi:10.1038/nature08494; Received 25 June 2009; Accepted 11 September 2009

The failure of looking for simple correlations in medical research, and in particular within cancer research makes the tenet that somehow having more data means that causation can be replaced with correlation laughable.  That said, correlations are often a pretty good place to start looking (which was always the intention of GWAS), but making predictions based solely around correlations is unlikely to get you very far as an analyst.

Any serious work a data scientist, analyst or statistician is likely to be employed to work on is rooted in the real world: “Will this drug work better than this other drug?  Will a set of lost customer return if we offer them a discount?”  Thus a data scientist must be able to explain the answers generated by their model not only in terms of statistics, but also in terms of the real world.  Because no matter how good your model is, you are always going to have someone in a nicer suit than yours ask the question, “Why?” and discussing the intricacies of your Support Vector Machine / Regression / Classification Tree model is not going to impress them.

Simple recipes for predictive analysis of big data in R

For those of you not familiar with R, it is a scripting language designed for statisticians, written by statisticians.  It’s important to remember the fact that it’s primarily written by statisticians because it is not a beautiful language in the way that Python, Haskell or C may be.  R is a language of convenience, a haphazard set of commands that take a while to get your head around as there is no real consistency between libraries (as you will see!). 

It is R’s lack of consistency and the complexity of its functions means that it is easier to be productive in R when you have a few basic recipes for simple tasks, which is the reason I put together this document.  It is easy to get confused as to what functions require a data frame, versus a matrix (and should it be transposed?), a vector versus a factor and so on.

Of course, had this analysis found a sure fire way to make money, the last thing I would have done is to publish this analysis!  So if you are looking for an article on predicting stock price movements, this is only going to frustrate you.  However if you care about learning some basic statistical analysis in R, hopefully you find this interesting.

Predictive Analysis

This post will discuss a few statistical methods for predictive analysis using supervised learning techniques.  The techniques are supervised as they require a “training” data set and a test data set which we use to validate how good our models are.  Further, we are attempting to “classify” data into groups rather than trying to predict individual values.


These techniques are techniques used by Bioinformaticians to predict whether certain gene expressions levels are able to predict what disease state a patient may have.

This gets a bit complicated, so instead I will use a more accessible example, where we try to predict movements in Apple’s share price.  This is probably the worst way these techniques could be used, but it makes for a bit of fun.

Here we want to predict if Apple’s share price is going to go “down a lot” (down by more then -1%), be “unchanged” (up or down 1%), or up a lot (up more than 1%).  By defining these states, we can then use our classifier to predict what the stock price is going to do (initially I had 5 states, including “up a bit” and “down a bit”, but the results were even worse).

Our hypothesis is that there is some combination of stock changes on a Monday that will tell us something about how Apples price will move on Tuesday.  We hope to use statistics and machine learning to find that combination.

So where is the “big data”?  Well in order to build our classifier we are going to use the daily percentage change of every stock on the NASDAQ, which is about 2,000 different stocks.  We are going to do this over 200 days of trading data for our training set, then 20 days for our test set.  This is only 400,000 data points which is pretty small, but I wanted an example where I could distribute the data easily.  Bioinformaticians will use similar techniques where rather than 2,000 stocks we may have 1,000,000 different loci on the genome and rather than 200 days, we may have 20,000 samples (20,000,000,000 data points which I think classifies as “big data”).

You can download the data used in this analysis here [4mb].


Loading new libraries

The strength of R is in its libraries which we will make extensive use of.  To get them all installed, you will need to enter the following commands.


Reading in the data

Firstly download the files here, and extract them to a directory.  We then need to point R to the directory you extracted the files to (“File” > “Change dir”).  Enter the following commands:

X <- as.matrix(read.csv("training.csv"))
X.T <- t(X)
Y <- read.csv("training_changes.csv")[,]
zz <- as.vector(Y)
# Remove the "DOWN_SML" and "UP_SML" categories 
zz[zz=="DOWN_SML"] <- "UNCH"
zz[zz=="UP_SML"] <- "UNCH"
Z <- as.factor(zz)


X now contains our training data, which has a column for each stock and a row for each day.  Each cell is the percentage change in price for that day.  X.T contains the same data, but the rows and columns switched.  Z contains our classifications (“DOWN_BIG”, “UNCH”, “UP_BIG”) for Apple on the next day.  Y contains a higher level of accuracy, discriminating between small moves up and small moves down (less than 1% in each direction), which we will leave out as they ruined the downstream analysis!

Narrowing down the data

We have 2,580 stocks here that might have an impact on our model, but most of them probably won’t.  So what we want to do is to exclude all those that have no significance.  The way we do this is to calculate permutation adjusted p-values for step-down multiple testing procedures.  Essentially we doing an ANOVA regression for each of the different stocks to determine if the stock is significant in determining what Apple’s stock will do the next day. 

Let us assume that a stock is significant in predicting Apple’s stock if it has a p-value of 0.05 or smaller (the probability that the variance between classifications is the same as the variance within classifications).  Because we have 2,580 stocks, we would expect to find 129 stocks that are significant just due to random chance (0.05 * 2,580).  We therefore need to do some kind of “multiple testing” adjustment, which is exactly what the following command does.

maxT <- mt.maxT(X.T, Y, B=1000, test="f")
length( which(maxT[,4] < 0.05 ))
rownames(X.T)[maxT[ which(maxT[,4] < 0.05), ]$index]

This gives us 10 stocks that are significant at the 0.05 level when corrected for multiple testing ("PCO"  "CCBG" "ABAX" "ELRC" "FISI" "INTL" "HFWA" "WINA" "CDZI" "BMTC").  We will limit our downstream analysis to just these stocks, using the following code.

X.T.I <- X[,maxT[ which(maxT[,4] < 0.05), ]$index]
X.I <- t(X.T.I)

K-Nearest Neighbours Clustering

This is a simple classification system that classifies an unknown point by giving the point the same classification as the k nearest “known” points.  We can build this classifier and test it on our training data (self-validation) as follows:

X.KNN <- knn( X.T.I, X.T.I, Z, k=10)
sum(X.KNN == Z) / length(Z)

This gives us a success rate of 0.59, which is optimistic as we are using the same set to train as to validate our model.  We can use cross-validation (where we leave out a chunk of the training set to train, then test it on the remaining set of data) to get a more accurate estimate of success:

X.KNN.cv <- knn.cv(X.T.I,  Z, k=10)

Here we see that using cross validation, our success rate drops to 0.5, which is certainly not at a level you would feel comfortable investing your hard earned cash on.

Of course, the choice of K=10 is completely arbitrary, so we can attempt to find a better value of K by trying a large number.  We get our best value at K=7, but it’s still only 0.31.  However each time you run a cross-validation you will get a different number due to the random selection of which chunk it left out for validation.

Penalized Generalized Linear Models

Linear models are rather simple, fitting a straight line to a set of data based by minimizing the error (distance from the model).  Generalized Linear Models are a generalization of this technique that makes different [thanks LikeLeeHood of /r/statistics] assumptions about the data, enabling it to model non-normal distributions (any distribution that is a member of the exponential family of distributions, which includes binomial and Poisson).  The GLMNET package uses penalized linear models, rather than true linear models.

X.GLM <- cv.glmnet(X.T.I, Z, family="multinomial", type.measure="class")
X.GLM.M <- glmnet(X.T.I, Z, family="multinomial", lambda=X.GLM$lambda.min)
X.GLM.P <- predict(X.GLM.M, type="class", newx=X.T.I)
sum(X.GLM.P == Z) / length(Z)

Here we use cross validation to select the minimum lambda for our model (the model with the least error), then predict using the model.  When we run the model against our training data, we end up with a classification rate of 0.54, only slightly better than KNN.  One must keep in mind that these predictions are made with the same data that generated the model, which makes this poor performance even sadder.

Support Vector Machines

If you think of a linear model as trying to draw a line through a cloud of points, support vector machines are all about drawing lines to separate different clouds of points that belong to our different classifications.  In the linear case, these lines are straight, but we can step it up a notch by transforming these otherwise straight lines so that they form curves.  For where we have more than 2 dimensions, we create hyper-planes to separate our clouds out.  There is some pretty heavy math behind all this, but luckily there is an R library for that!

X.F <- data.frame(Z, X.T.I)
X.SVM.B <- best.svm(as.factor(Z)~., data=X.F, type="C-classification", kernel="radial")
X.SVM.M <- svm(as.factor(Z)~., data=X.F, type="C-classification", kernel="radial", cost=X.SVM.B$cost, gamma=X.SVM.B$gamma)
X.SVM.P <- predict(X.SVM.M, X.F)
sum(X.SVM.P == Z) / length(Z)

This yields a correct classification rate of 0.63, which is our best rate so far.

Classification Trees

Classification trees attempt to build a tree like structure where a threshold is created for the value of each stock whereby if it’s less than that value, we go down the left branch of the tree, and if its greater we go down the right branch.  After traversing down many levels of the tree, we end up at a node which should tell us the classification.  We will draw out the tree with the following commands:

X.CT.M <- rpart(Z~., data=X.F, method="class")
plot(X.CT.M, mar=c(0.1, 0.1, 0.1, 0.1))
text(X.CT.M, use.n=TRUE)

It looks pretty ugly, and is likely to have overfit the data, so we want to prune the tree to remove excess decision points to make our model more general.  This is done with the following command:

X.CT.Pruned <- prune(X.CT.M, cp=0.021739)

Where 0.021739 is the CP value that has the lowest cross validation error (in this case 1.1087).  We then run our classifier against our training set to get a feel for how good it is.

X.CT.P <- predict(X.CT.Pruned, type="class", newdata=X.F)
sum(as.vector(X.CT.P)==Z) / length(Z)

This yields a correct classification rate of 0.69, putting classification trees in the lead.


Up to this point we have only used the training data to validate our models, which is pretty much meaningless.  What we need to do is to test our models on a separate set of data.  Luckily we have a training set of data, which are the classifications over the next 20 days (4 days UP_BIG, 9 days UNCH and 7 days DOWN_BIG).  So let’s see how our models do against actual data.

X.Test <- as.matrix(read.csv("test.csv"))

# Select only the stocks we found significant earlier
X.Test.I <- X.Test[,maxT[ which(maxT[,4] < 0.05), ]$index]
Y.Test <- read.csv("test_changes.csv")[,]
zz <- as.vector(Y.Test)
zz[zz=="UP_SML"] <- "UNCH"
zz[zz=="DOWN_SML"] <- "UNCH"
Z.Test <- as.factor(zz)



X.Test.KNN <- knn( X.T.I, X.Test.I, Z, k=10)
sum(X.Test.Knn == Z.Test) / length(Z.Test)
[1] 0.4


X.Test.GLM.P <- predict(X.GLM.M, type="class", newx=X.Test.I)
sum(X.Test.GLM.P == Z.Test) / length(Z.Test)
[1] 0.45



> X.Test.F <- data.frame(Z.Test, X.Test.I)
> X.Test.SVM.P <- predict(X.SVM.M, X.Test.F)
> sum(X.Test.SVM.P == Z.Test) / length(Z.Test)
[1] 0.45



> X.Test.CT.P <- predict(X.CT.Pruned, type="class", newdata=X.Test.F)
> sum(as.vector(X.Test.CT.P)==Z.Test) / length(Z.Test)
[1] 0.3



So none of our techniques really inspire much confidence, with none of the strategies being correct more than 45% of the time, which is not much better than guessing.

What if we take the strategy that we only buy AAPL stock when our model tells us that the next day is going to be “UP_BIG”?  Well our KNN model suggested this would happen 4 times, with 2 of those times resulting in big losses (AAPL was down by more than 1%), and the other two being “UNCH”.  Classification Trees did a bit better with its 4 predictions, getting 1 correct and 3 “UNCH”.

The best advice came from SVM and GLM, as they predicted “UNCH” for each of the days, effectively telling us not to bother.

Interestingly, both GLM and SVM predicted “UNCH” for all of the test dataset, providing perhaps the best advice – not to invest using this strategy at all!

If you feel I have made any errors in this analysis, please let me know - there has been a lot of copying and pasting between R and Word!

Linking C libraries with go-lang and "-std=c99"

I have been playing around with Google's new Go language ("golang" or "go-lang" if you are trying to Google for it!), and have been most intrigued about its ability to call C libraries.

Unfortunately, I found it very difficult to get up and running with this.  The c code I was attempting to integrate with uses the "-std=c99" flag.  Therefore, my Go file looked like:

package main
import "fmt"

// #cgo CFLAGS: -std=std99 -msse4.1 -I../c/core/
// #cgo LDFLAGS: -L. -lgotest
// #include "go-test.h"
import "C"
func main() { 
   cs := int(C.quick_test())
   fmt.Println("Hello, ", cs )

Which when compiled would give the following output:


 go build tester.go
# command-line-arguments
tester.go:7:1: error: return type defaults to 'int' [-Werror]
tester.go: In function 'typeof':
tester.go:7:20: error: expected declaration specifiers before '*' token
tester.go:8:1: error: parameter '__cgodebug_data' is initialized
tester.go:11:1: error: excess elements in scalar initializer [-Werror]
tester.go:11:1: error: (near initialization for '__cgodebug_data') [-Werror]
tester.go:7:1: error: type of 'quick_test' defaults to 'int' [-Werror]
tester.go:8:11: error: declaration for parameter '__cgodebug_data' but no such parameter
tester.go:11:1: error: expected '{' at end of input
tester.go:11:1: error: control reaches end of non-void function [-Werror=return-type]
cc1: all warnings being treated as errors


Which is to say, nothing useful at all (although hopefully Google will find this, and someone searching for the errors will find this and not have to waste an afternoon debugging)!

Removing the "-std=std99" or replacing it with "-std=gnu99" will make everything work properly again.





Why there are not 86,000 men for 1.3 million women in Australia & why journalists can't be trusted with statistics


So I read the following paragraph in the Murdoch press today:

A church official told the Herald Sun newspaper there has been a massive decline in the number of available men, with statistics claiming there are just 86,000 Mr Rights for 1.3 million women aged between 25 and 34.



So that means that there a woman has only a 5% chance of landing a good man!  Oh my, what a problem!

The article then breaks it down for us, such that:

Demographer Bernard Salt has calculated that of the 1.343 million men in the same age bracket, only 86,000 single, heterosexual, well-off, young men were available after excluding those who were already married (485,000), in a de facto relationship (185,000), gay (7000), a single parent (12,000) or earning less than $60,000 a year.

So let’s think about that for a second.  Who are the 485,000 men married to?  Who are the 185,000 men in de-facto relationships with?   Well since they are not in the 7,000 gays, they are probably in relationships with women. That leaves only 630,000 women who are similarly available.

The 12,000 men who are single parents are also likely to have created that child with a woman and unless its one woman with 12,000 children to different fathers, one may assume that there are a similar, if not more women who are single parents.  So now we are down to 618,000 "eligible" women.

But wait, the article states that there are only 86,000 eligible men.  This seems to be because 568,000 men who are excluded from being "eligible" because they earn less than $60,000.

Aside from the fact that money shouldn't rule out love, this seems to be a rather high standard to be excluding men, especially when it excludes more than half a million otherwise eligible bachelors.  What are these men to do, are they doomed never to find love? 

How many "eligible" women would there be if we also applied the $60,000 cut off?

This number is hard to find, so we'll have to make some guesses.  According to the ABS, the average weekly earnings for a man 25-29 is $1,082 and 30-35 is $1,271 (so let’s split the difference and call that $61,100 per year).  For a woman, the average weekly earnings for a 25-29 year old is $904, and 30-35 year old is $923 which using the same technique works out at $47,502.


It is therefore fairly clear that fewer women are going to meet this wage criteria than men.

Therefore, rather than their being 15 woman for each "eligible" man, it seems fairly clear that there are in fact many more eligible men (using the ridiculous $60,000 criteria for what eligibility) than there are women.











Building data structures that are smaller than an Array AND faster (in C)

This is being discussed over at Hacker News, check it out.

 In developing software for large data sets (billions of records, terabytes in size) the way you store your data in memory is critical – and you want your data in memory if you want to be able to analyse it quickly (e.g. minutes not days).

Any data structure that relies on pointers for each data element quickly becomes unworkable due to the overhead of pointers.  On a 64 bit system, with one pointer for each data element across a billion records you have just blown near 8GB of memory just in pointers.

Thus there is a need for compact data structures that still have fast access characteristics.       

The Challenge

The challenge is to come up with the fastest data structure that meets the following requirements:

  • Use less memory than an array in all circumstances
  • Fast Seek is more important than Fast Access
  • Seek and Access must be better than O(N).

Where Seek and Access are defined as:

Access (int index): Return me the value at the specified index ( like array[idx] ). 

Seek (int value): Return me all the Indexes that match value.

(The actual return type of Seek is a little different, but logically the same.  What we need to return is a bitmap where a bit set to 1 at position X means that value was found at index X.  This allows us to combine results using logical ANDs rather than intersections as detailed here)


Possible solutions

Array (array) 

The first solution which we compare all others too is the humble integer array. 

Memory: 4 bytes for each integer, making space complexity O(N).

Access: Nice and fast, O(1).

Seek: Awful! We need to check every item in the list and is O(N).  Not good enough!


Wavelet trees (wtree)

This is a type of “succinct data structure”, which is pretty incredible.  It actually blew my mind when I first read about it here

The structure computes a list of all the available keys (distinct values, ordered of length K), then stores a bit array that is N bits long (the number of values).  It then needs Log2(K) such bit arrays, which I will refer to as rows.  Essentially the bitmap is a representation of a binary search tree.  

The data structure takes a bit of work to get your head around, so I will explain it with an example.  Say we have the data:

3 2 4 6 2 6 (Keys: 2, 3, 4, 6)

If we store the keys, we can re-write our data as:

            1 0 2 3 0 3

Simply by storing the indexes to our keys array (our first element in our data is 3, which can be found at keys[1]). Easy yeah?

Now we build out first bitmap row.  Looking at our new data set, our minimum is 0, our maximum is 3, so if we do (max – min) / 2 we get 1, which we will use as our middle value.  So to write out first row, we will write a 0 if the value in our index is less than or equal to our middle value (in this case 1) and 1 if it’s greater.

This means our first row is:

            0 0 1 1 0 1

For our next row we group all our ones and zeroes together, such that we have a) and b), where a) is made up of all the zeroes and b) is made up of all the ones.

Group a) is now made up of the following values 1 0 1 which we use to repeat the same process as the first row.  We now use the “middle” value from the last row as our “maximum”, and keep the same minimum.  We then calculate our new “middle” (max – min) / 2 = (0 – 1) / 2 = 0.  This means for Group a) we need to write a zero for each 0 and a one for each 1 (that’s worked out nicely hasn’t it!).  Therefore the bitmap for Group a) is:

            1 0 1

Group b) is made up of the following values: 2 2 3, so we can repeat the same process as above.  Here our max is 3, min is 2 and our middle value is 2.  This means our final bitmap for Group b) is:

            0 0 1

Meaning our representation of the 6 values is now:

            0 0 1 1 0 1 (first row)

            1 0 1 0 0 1 (Group a + Group b)

Awesome, our 6 values are now represented by 4 integers and 12 bits.  You can see how for each additional value we add (provided we don’t add more keys) only adds a pair of bits to our storage size, making this structure very efficient.


Accessing these structures is a little more complex than accessing an array. Essentially we walk down the “rows” checking whether the bit at the corresponding index is set.  The key here is that at each level our index needs to be modified to compensate for the fact we have “grouped” values at each level. 

So if we are looking for the value at index 3 (which we know to be 2), we first inspect the bit at index 3 of row 0.

            0 0 1 1 0 1

Since this is a 1, our index needs to change to be the first index of group B.  This is done by counting the number of Zeroes in the first  complete row, which gives us the point where our group starts.  By then also tracking how many “1” bits are set before my index, I can adjust the index to check for the next row.

The bad news is that this means counting a lot of bits.  The good news is than Nehalem processors have an SSE4.2 instruction that enables us to count 0’s and 1’s VERY quickly (popcnt).  The bad news is that the indexes are not necessarily aligned to words, so additional work is required in aligning bitmaps to word boundaries (which I have not done).


Seek is much more complex, but works by traversing our keys list as if it was a binary search tree.  Here we first take the middle value of our keys, and compare it to our value.  If our value is larger than the middle value, our bitmask tells us which values are also larger. For example you can think of the first row as telling us which numbers are bigger than 1 (where 1 is set), and which are smaller (where 0 is set).

The goal of the seek to return a bitmap the same length as our rows, with a 1 set wherever the value matches the corresponding index.

So if the value we are searching for is larger than the middle value, our temporary mask becomes the first row.  If it was smaller, then we flip the bits on the first row.

We are then directed to either group a) or group b) depending on the result of the first comparison (group a if its smaller, or group b if its larger.   At this point we find out what the “middle” value is for this group and compare it to the value we are seeking.  If our seek value is larger than this value, we can just use bits as is.  If its smaller or equal then we need to invert it.  We then use this as a “group mask” which we use to refine our temporary mask.

The refine operation simply looks at all the 1 bits in our temporary mask, and works out what number bit it is in the mask if there were no 0’s present. (thus in “0 0 1 0 1 0 ”, the index of the first 1 would become 0, and second 1 would be 1).  We then use the “group mask”, which tells us which of those set bits need to be set to one.  Take the following example:

0 0 1 1 0 1      (temp mask)

0 1 0               (group mask – note it is 3 bits long the same as the number of 1’s in temp mask)

0 0 0 1 0 0      (result, the first and third ones are set to 0).

I have implemented the code to do this, but it is far from efficient (if you like making C code fast, this may be a good project for you!).  In my implementation I also store the “rank indexes” (the start of each group) so as to prevent the entire bitmap from needing to be scanned all the time, which makes it faster but use more memory.

Hybrid (tmap)

Given the complexity of Wavelet tree’s and its disappointing performance, I developed a hybrid, which stores values in the same way as a wavelet tree, but does not group values.  This makes accesses much faster as the index doesn’t change.  It also makes seeks much faster than an array, as we just do a logical AND down the rows.  Since our bitmaps are implemented in 64bit integers, and we have only Log2(K) rows, we can test (64 / (Log2(K) -1)) values per operation. 


So how do they perform?




























Access Performance


























Seek Performance



























There is no doubt that on paper, the Wavelet tree structure should perform the best out of all the structures, but as we can see here, its performance is AWFUL.  I have no doubt that there is plenty of room for optimization (particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.

So the challenge remains open.  Can you make a faster, smaller array than those described here?

You can view the source code for this project at https://github.com/siganakis/


Why are column oriented databases so much faster than row oriented databases?

I have been playing around with Hybrid Word Aligned Bitmaps for a few weeks now, and they turn out to be a rather remarkable data structure.  I believe that they are utilized extensively in modern column oriented databases such as Vertica and MonetDB.

Essentially HWABs are a data structure that allows you to represent a sparse bitmap (series of 0's and 1's) really efficiently in memory.  The key trick here is the use of run length encoding to compress the bitmap into fewer bits while still allowing for lightening fast operations.  

They key operation from my perspective is the "AND" operation.  "Why is that useful?" I hear you ask.  Well, imagine we have the following query:

SELECT Animal, Colour, COUNT(*) 
FROM AnimalDetails 
WHERE City='Melbourne'
GROUP BY Animal, Colour

On a table that looks like:

Animal, Colour, City
Dog, White, Melbourne
Cat, Black, Sydney
Cat, White, Melbourne
Dog, Black, Melbourne

In order for me to process the query, I need first to calculate the WHERE condition, which is to say find all the rows that satisfy the condition "City='Melbourne'".  

Then, given those rows (a list of row ids) I need to group the results to Animal and then by Colour.

Imagine for a moment that our data for "Animal", "Colour" and "City" are all stored as simple arrays, with the array index 0 for all arrays creating row 0, so Animal[0] = Dog, Colour[0] = White, City[0] = Melbourne, forming the first row.

That would work well, but its a bit clunky since I have to visit every row in the table to satisfy the query.  What I really want to do is to index the data, so I change the structure of the data, such that each Array is now a map using the distinct values as keys and a list of the row_ids that the value is found in.  Then my "Animal" column can be represented as:

Animal["Dog"] = {1, 4}
Animal["Cat"] = {2, 3}

Indicating that for the column "Animal" the value "Dog" can be found at rows 1, 4.  This is repeated for all the other columns.

This means that satisfying my WHERE condition means I only need to seek to the "Melbourne" value in my City structure, and read off the list of row_ids {1, 3, 4}.  Awesome!

It then also means that for my group by conditions I can do the same thing, but a little differently.  Since I have a list of all the distinct values, I can just walk through the list of distinct values and collect the list of row ids, then Intersect it with the result of the WHERE condition.

So to find the Distinct values of "Animal" that satisfy the condition I end up with:

Seek for "Dog" in Animals, return {1, 4} then INTERSECT {1, 3, 4} = {1, 4}
Seek for "Cat" in Animals, return {2, 3} INSERSECT {1, 3, 4} = {3}

I can then take those results and intersect them with the results of "Colour":

"White" {1, 2} INSERSECT {1, 4} = {1} (White Dog)
"Black" {3, 4} INSERSECT {1, 4} = {4} (Black Dog)

"White" {1, 3} INTERSECT {3} = {3} (White Cat)
"Black" {2, 4} INSERSECT {3} = {} (No Black Cats)

The problem here is all these INTERSECT operations are really expensive, especially when the conditions end up producing a lot of rows.  

On my test data set (a million rows), a similar query to this took about 4000ms, with the vast majority of time being taken up in calculating the intersections.

Now, if you structure the data a little bit differently you can represent the values as bitmaps.

Animal["Dog"] = 1001
Animal["Cat"] = 0110

Here, "Dog" can be found at row id 1 and 4.  Therefore we set the value of the bit at rows 1 and 4 to 1, and 0 elsewhere.

Now my first index seek to satisfy my WHERE Condition is just a matter of looking up the Bitmap for "City = Melbourne", which I see is 1011.

I can then find my bitmaps for "Animal" and do the AND operation on the bitmaps rather than the intersection, giving me:

"Dog" 1001 AND 1011 = 1001 (rows {1,4})
"Cat" 0110 AND 1011 = 0010 (rows {3})

In my example, running the query using Bitmaps now takes in the order of 40ms, or about 100 times faster.

My next post will probably go into a bit more detail about how Hybrid Word Aligned Bitmaps actually work.  But if you are curious have a look at: 


Or a pretty good .Net implementation:







Compiling .Net (C#) on Red Hat Enterprise Linux without Mono installed

Our super computing cluster at the VLSCI is currently running Red Hat Enterprise Linux.  While we can simply send the helpdesk a request to get pretty much any free applications / libraries installed on the cluster, asking them to install Mono seemed embarrassing.  So I set about making a simple .Net assembly work on the cluster.

Mono includes a tool called "mkbundle" which allows for the bundling of .Net assemblies into single executables with no dependencies on mono.

I run Linux Mint for my normal Linux development, which doesn't seem to be fully compatible with RHEL, due to differences in glibc versions and countless other things.  Therefore, running the usual mkbundle run on my executable "mono-mem.exe" (which was compiled by VS2010 on windows):

mkbundle mono-mem.exe  -o mono-mem.broken  --deps --static -z

(where "mono-mem.broken" is our output and "-z" tells it to zip the libraries)

Will not result in an executable that will run on RHEL, instead you'll end up with output like:


[terences@merri mono-mem]$ ./mono-mem.broken 
./mono-mem.broken: /lib64/libc.so.6: version `GLIBC_2.9' not found (required by ./mono-mem.broken)
./mono-mem.broken: /lib64/libc.so.6: version `GLIBC_2.11' not found (required by ./mono-mem.broken)
./mono-mem.broken: /lib64/libc.so.6: version `GLIBC_2.8' not found (required by ./mono-mem.broken)

Which is not helpful at all.

Rather, what we need to do is to use mkbundle to generate a library with all the required mono stuff in it, and a c file containing our entry point to the app such that the c file can be compiled on our server (and thus get its dynamic libraries all sorted out).

Turns out this is a bit tricky, but doable.

Firstly, we use mkbundle to generate the c file and the bundle of libraries with this command:

/mkbundle -c -o mono-mem.c -oo bundle.o --deps mono-mem.exe reference_library.dll
(Where "-c" tells it to build the c file but not the final assembly "-oo" builds the bundle library and "reference_library.dll" is a dll that my app needs).

From this, we copy mono-mem.c to our cluster along with bundle.o.

But, we also need the header files and library files from mono for the RHEL / CentOS platform.  To get access to these, I needed to create a CentOS install with mono and then copied the contents of "lib64" and "include" from the mono directory.  For your convenience I have extracted these and created an archive that you can download here:

With those files extracted to your working directory, compiling is as simple as:

cc mono-mem.c bundles.o /PATH_TO_LIBS/mono-2.10/lib64/libMonoPosixHelper.so /PATH_TO_LIBS/mono-2.10/lib64/libmono-2.0.a -I/PATH_TO_LIBS/mono-2.10/include -L/PATH_TO_LIBS/mono-2.10/lib64 -lpthread -lc -lrt -lm -ldl -lz -omono-mem.o
(Where PATH_TO_LIBS is the path to where you extracted the libraries).

Then it is just a matter of updating your LD_LIBRARY_PATH to include your mono libs:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/vlsci/VR0002/terences/build/mono-2.10/lib64/

And the app should run perfectly. 



Assessing database engines for use in GQL (Genome Query Language)

In an ideal world, it would simply be a case of using SamTools to load a BAM file into a database that we could query directly using SQL.  Unfortunately, I have yet to be able to find an appropriate database that meets my following requirements:

Open Source
In order to run software on a super computing cluster, either the software must be open source, of I have to be able to convince the super computer administrators to buy a license.  Given the arduous process and expense associated with purchasing licenses for a 8,000+ cores, this pretty much rules out anything that is not open source.

The individual nodes as a part of the VLSCI cluster are not especially powerful, its the running many nodes at a time that should enable the fast processing of data.

The shared disk subsystem that the nodes use is extremely slow when compared to accessing memory directly.  In order to process these data volumes quickly to enable near real-time querying and modifications to queries, data must be stored in memory.

Column Oriented
The genomic data from BAM files suits a column oriented database better than a row oriented one due to improved compression, meaning that less memory is required.  Less memory means fewer nodes are required and thus it will be easier to actually run the software (smaller jobs move up the queue faster than larger jobs).

With this in mind, I have looked at the following database systems:

SQLite & Python
My first cut used SQLite, with a python layer on top to enable distributed processing of queries (with inter node communication happening over HTTP).  This has worked rather well on the cluster, as I can just qsub the head-node script, then qsub as many worker nodes as I want.  

A python script reads the BAM file and writes 100 files, with each file containing 1/100th of the total reads.  When worker nodes are created, the head-node instructs each worker to load a file in turn until all the 100 files are evenly distributed over the worker nodes.

The actual query needs to be broken down into two sub queries, with one being run on each of the worker nodes.  The output from each of the worker nodes is then sent to the head node.  The head-node then executes the second query against the combined output, providing the final output.

The primary issue here is the memory utilization.  Against the 300GB of normal/tumour paired data, about 110GB of memory would be required (e.g. 11 nodes with 10GB of memory each), and that is without any indexing at all.  

Due to its simplicity and the reliance on Python (which I am in love with almost as much as I am in love with C#), this should be very easy to implement.

By building a memory efficient data structure (before applying any compression at all), it is possible to reduce the memory requirements by a half, meaning the entire 300GB data set could fit into about 50GB, with the data being separated out into 10MB "pages".   By using MPI, it should then be possible to distribute pages amongst many different nodes.  During querying, each "page" could be queried separately in parallel, with the output from each page then being collected to form the final output.

The key disadvantage of this method is that you lose the SQLite query engine, meaning I have to build my own SQL parser and query engine.  This will be rather time consuming, especially since subqueries and joins are vital parts of this.  

I had hoped that I could replace SQLites internal storage engine ("b-tree") with my own column oriented one, but it appears that this is a lot of work (although the people at Oracle managed to do it with Berkeley DB, so there may be some hope).

Built as a proof of concept for column oriented stores, this looked like it would fit the bill perfectly.  However, it is no longer under active open source development, and has since transformed into the commercial product Vertica.  While the source is available, it requires old versions of Berkley DB and just seemed like a pain to get up and running.

An open source column oriented database system, but unfortunately it does not appear to do a very good job of compressing data, with a 610GB workload resulting in a 650GB database (without indexes).  That pretty much rules this one out for this workload.  Oh, and it has no distributed querying capacity.   

On the same test as MonetDB, the 610GB workload was compressed to 120GB (without indexes), which is very good.  Unfortunately though its query performance seems quite poor according to this article.  It also does not have any distributed querying capacity. 

HyperTable - Nope (see below)
HyperTable looks really promising, and is built up from the ground to be distributed.  Its HQL syntax is very similar to SQL  The only problem is that it runs on top of Hadoop, which makes it somewhat tricky to setup on a shared cluster like we have at the VLSCI.  I may revisit this later down the road.  

Turns out HQL doesn't support aggregate queries, which are central to this project, so it is not suitable. 


Anyone know of any other database's that may be worth looking at?


Genome Query Language

The problem

Whole genome data sets are rather large, with alignment files (.BAM) often topping 150GB or more. In cancer research this is even more difficult, as for each patient we often have a normal sample, a tumor sample and sometimes post-treatment samples.  Multiply this by the number of patients within your sample and all of a sudden we are talking about petabytes of data, all of which needs to be analysed.

This makes analysis really difficult, as many analysis tools will take days if not weeks to run on these types of data sets.  Even more challenging is that many of these tools are fragile, or have many settings that need to be tweaked to produce meaningful results.  

This leads to the situation where you run a tool (wait 5 days), realize you have messed up a setting, run it again (wait 5 days) realize your data has another peculiarity and needs to be further filtered, run it again (wait 5 days), and so on.  Before you know it, a month has passed and you still have meaningless data.

Even more annoying is that my specific project has access to super-computing resources at the VLSCI - but none of the tools are really built to make use of the high powered cluster that we have. 

The solution

A simple, SQL like language that enables you to interrogate genomic information in real time by leveraging high performance computing clusters.  By utilizing a query language, it will be possible change how data is aggregated and filtered in a simple, consistent manner. 

An example of a query

 Say you wanted to compare copy number differences between a Tumor sample and a Normal sample.  You decide the best way to do this is to find all the 2000bp regions where there are more than twice as many reads as there are in the normal sample.  A query may look like:




 TumorBuckets.Reads AS TumorReads,

 NormalBuckets.Reads AS NormalReads




 -- Number of reads in the bucket
COUNT(*) AS Reads,

-- The position, seperated into 2000 nucleotide buckets
Bucket(T.AlignedPosition, 2000) AS BucketPos,

FROM Tumour T

-- Filtering

-- No PCR Duplicates
T.IsDuplicate = 0

-- Must have atleast 49 matches (out of our 50BP reads)
AND T.Matches >= 49

-- Must not have any hard clips
AND T.HardClips = 0

-- The Probability of a wrong alignment must be less than 0.05
AND T.MapQ < 0.05

-- The average probability that a base call is wrong must be less than 0.05
AND T.AvgQuality < 0.05


Bucket(T.AlignedPosition, 2000)

) TumorBuckets




COUNT(*) AS Reads,

Bucket(N.AlignedPosition, 2000) AS BucketPos,

FROM Normal N

-- Filtering

N.IsDuplicate = 0

AND N.Matches >= 49

AND N.HardClips = 0

AND N.MapQ < 0.05

AND N.AvgQuality < 0.05


Bucket(T.AlignedPosition, 2000)

) NormalBuckets
ON TumorBuckets.BucketPos = NormalBuckets.BucketPos

-- We want more than twice as many reads in the tumor bucket than the normal one
AND TumorBuckets.Reads > NormalBuckets.Reads * 2


This way we have a simple definition of what we want to achieve.  There is no need to spend time researching the samtools interface, no hunting for different tools, modifying / recompiling tools.  Just a simple definition of what you want to find.

And, the goal is that such a query should return (on 300GB of data) in under a minute on a 10 node cluster.  That means you can experiment with filtering changes in near real time.

I plan on making this open-source when its ready, and it will be hosted on github.com