Assessing database engines for use in GQL (Genome Query Language)

In an ideal world, it would simply be a case of using SamTools to load a BAM file into a database that we could query directly using SQL.  Unfortunately, I have yet to be able to find an appropriate database that meets my following requirements:

Open Source
In order to run software on a super computing cluster, either the software must be open source, of I have to be able to convince the super computer administrators to buy a license.  Given the arduous process and expense associated with purchasing licenses for a 8,000+ cores, this pretty much rules out anything that is not open source.

Distributed
The individual nodes as a part of the VLSCI cluster are not especially powerful, its the running many nodes at a time that should enable the fast processing of data.

In-memory
The shared disk subsystem that the nodes use is extremely slow when compared to accessing memory directly.  In order to process these data volumes quickly to enable near real-time querying and modifications to queries, data must be stored in memory.

Column Oriented
The genomic data from BAM files suits a column oriented database better than a row oriented one due to improved compression, meaning that less memory is required.  Less memory means fewer nodes are required and thus it will be easier to actually run the software (smaller jobs move up the queue faster than larger jobs).

With this in mind, I have looked at the following database systems:

SQLite & Python
My first cut used SQLite, with a python layer on top to enable distributed processing of queries (with inter node communication happening over HTTP).  This has worked rather well on the cluster, as I can just qsub the head-node script, then qsub as many worker nodes as I want.  

A python script reads the BAM file and writes 100 files, with each file containing 1/100th of the total reads.  When worker nodes are created, the head-node instructs each worker to load a file in turn until all the 100 files are evenly distributed over the worker nodes.

The actual query needs to be broken down into two sub queries, with one being run on each of the worker nodes.  The output from each of the worker nodes is then sent to the head node.  The head-node then executes the second query against the combined output, providing the final output.

The primary issue here is the memory utilization.  Against the 300GB of normal/tumour paired data, about 110GB of memory would be required (e.g. 11 nodes with 10GB of memory each), and that is without any indexing at all.  

Due to its simplicity and the reliance on Python (which I am in love with almost as much as I am in love with C#), this should be very easy to implement.

C
By building a memory efficient data structure (before applying any compression at all), it is possible to reduce the memory requirements by a half, meaning the entire 300GB data set could fit into about 50GB, with the data being separated out into 10MB "pages".   By using MPI, it should then be possible to distribute pages amongst many different nodes.  During querying, each "page" could be queried separately in parallel, with the output from each page then being collected to form the final output.

The key disadvantage of this method is that you lose the SQLite query engine, meaning I have to build my own SQL parser and query engine.  This will be rather time consuming, especially since subqueries and joins are vital parts of this.  

I had hoped that I could replace SQLites internal storage engine ("b-tree") with my own column oriented one, but it appears that this is a lot of work (although the people at Oracle managed to do it with Berkeley DB, so there may be some hope).

C-Store
Built as a proof of concept for column oriented stores, this looked like it would fit the bill perfectly.  However, it is no longer under active open source development, and has since transformed into the commercial product Vertica.  While the source is available, it requires old versions of Berkley DB and just seemed like a pain to get up and running.

MonetDB
An open source column oriented database system, but unfortunately it does not appear to do a very good job of compressing data, with a 610GB workload resulting in a 650GB database (without indexes).  That pretty much rules this one out for this workload.  Oh, and it has no distributed querying capacity.   

LucidDB
On the same test as MonetDB, the 610GB workload was compressed to 120GB (without indexes), which is very good.  Unfortunately though its query performance seems quite poor according to this article.  It also does not have any distributed querying capacity. 

HyperTable - Nope (see below)
HyperTable looks really promising, and is built up from the ground to be distributed.  Its HQL syntax is very similar to SQL  The only problem is that it runs on top of Hadoop, which makes it somewhat tricky to setup on a shared cluster like we have at the VLSCI.  I may revisit this later down the road.  

Turns out HQL doesn't support aggregate queries, which are central to this project, so it is not suitable. 

 

Anyone know of any other database's that may be worth looking at?

 

Genome Query Language

The problem

Whole genome data sets are rather large, with alignment files (.BAM) often topping 150GB or more. In cancer research this is even more difficult, as for each patient we often have a normal sample, a tumor sample and sometimes post-treatment samples.  Multiply this by the number of patients within your sample and all of a sudden we are talking about petabytes of data, all of which needs to be analysed.

This makes analysis really difficult, as many analysis tools will take days if not weeks to run on these types of data sets.  Even more challenging is that many of these tools are fragile, or have many settings that need to be tweaked to produce meaningful results.  

This leads to the situation where you run a tool (wait 5 days), realize you have messed up a setting, run it again (wait 5 days) realize your data has another peculiarity and needs to be further filtered, run it again (wait 5 days), and so on.  Before you know it, a month has passed and you still have meaningless data.

Even more annoying is that my specific project has access to super-computing resources at the VLSCI - but none of the tools are really built to make use of the high powered cluster that we have. 

The solution

A simple, SQL like language that enables you to interrogate genomic information in real time by leveraging high performance computing clusters.  By utilizing a query language, it will be possible change how data is aggregated and filtered in a simple, consistent manner. 

An example of a query

 Say you wanted to compare copy number differences between a Tumor sample and a Normal sample.  You decide the best way to do this is to find all the 2000bp regions where there are more than twice as many reads as there are in the normal sample.  A query may look like:

 

SELECT 

 NormalBuckets.BucketPos,

 TumorBuckets.Reads AS TumorReads,

 NormalBuckets.Reads AS NormalReads

 FROM

(

SELECT

 -- Number of reads in the bucket
COUNT(*) AS Reads,

-- The position, seperated into 2000 nucleotide buckets
Bucket(T.AlignedPosition, 2000) AS BucketPos,

FROM Tumour T

-- Filtering
WHERE 

-- No PCR Duplicates
T.IsDuplicate = 0

-- Must have atleast 49 matches (out of our 50BP reads)
AND T.Matches >= 49

-- Must not have any hard clips
AND T.HardClips = 0

-- The Probability of a wrong alignment must be less than 0.05
AND T.MapQ < 0.05

-- The average probability that a base call is wrong must be less than 0.05
AND T.AvgQuality < 0.05

GROUP BY 

Bucket(T.AlignedPosition, 2000)

) TumorBuckets

INNER JOIN 

(

SELECT

COUNT(*) AS Reads,

Bucket(N.AlignedPosition, 2000) AS BucketPos,

FROM Normal N

-- Filtering
WHERE 

N.IsDuplicate = 0

AND N.Matches >= 49

AND N.HardClips = 0

AND N.MapQ < 0.05

AND N.AvgQuality < 0.05

GROUP BY 

Bucket(T.AlignedPosition, 2000)

) NormalBuckets
ON TumorBuckets.BucketPos = NormalBuckets.BucketPos

-- We want more than twice as many reads in the tumor bucket than the normal one
AND TumorBuckets.Reads > NormalBuckets.Reads * 2

 

This way we have a simple definition of what we want to achieve.  There is no need to spend time researching the samtools interface, no hunting for different tools, modifying / recompiling tools.  Just a simple definition of what you want to find.

And, the goal is that such a query should return (on 300GB of data) in under a minute on a 10 node cluster.  That means you can experiment with filtering changes in near real time.

I plan on making this open-source when its ready, and it will be hosted on github.com

Cheers!