In an ideal world, it would simply be a case of using SamTools to load a BAM file into a database that we could query directly using SQL. Unfortunately, I have yet to be able to find an appropriate database that meets my following requirements:
In order to run software on a super computing cluster, either the software must be open source, of I have to be able to convince the super computer administrators to buy a license. Given the arduous process and expense associated with purchasing licenses for a 8,000+ cores, this pretty much rules out anything that is not open source.
The individual nodes as a part of the VLSCI cluster are not especially powerful, its the running many nodes at a time that should enable the fast processing of data.
The shared disk subsystem that the nodes use is extremely slow when compared to accessing memory directly. In order to process these data volumes quickly to enable near real-time querying and modifications to queries, data must be stored in memory.
The genomic data from BAM files suits a column oriented database better than a row oriented one due to improved compression, meaning that less memory is required. Less memory means fewer nodes are required and thus it will be easier to actually run the software (smaller jobs move up the queue faster than larger jobs).
With this in mind, I have looked at the following database systems:
SQLite & Python
My first cut used SQLite, with a python layer on top to enable distributed processing of queries (with inter node communication happening over HTTP). This has worked rather well on the cluster, as I can just qsub the head-node script, then qsub as many worker nodes as I want.
A python script reads the BAM file and writes 100 files, with each file containing 1/100th of the total reads. When worker nodes are created, the head-node instructs each worker to load a file in turn until all the 100 files are evenly distributed over the worker nodes.
The actual query needs to be broken down into two sub queries, with one being run on each of the worker nodes. The output from each of the worker nodes is then sent to the head node. The head-node then executes the second query against the combined output, providing the final output.
The primary issue here is the memory utilization. Against the 300GB of normal/tumour paired data, about 110GB of memory would be required (e.g. 11 nodes with 10GB of memory each), and that is without any indexing at all.
Due to its simplicity and the reliance on Python (which I am in love with almost as much as I am in love with C#), this should be very easy to implement.
By building a memory efficient data structure (before applying any compression at all), it is possible to reduce the memory requirements by a half, meaning the entire 300GB data set could fit into about 50GB, with the data being separated out into 10MB "pages". By using MPI, it should then be possible to distribute pages amongst many different nodes. During querying, each "page" could be queried separately in parallel, with the output from each page then being collected to form the final output.
The key disadvantage of this method is that you lose the SQLite query engine, meaning I have to build my own SQL parser and query engine. This will be rather time consuming, especially since subqueries and joins are vital parts of this.
I had hoped that I could replace SQLites internal storage engine ("b-tree") with my own column oriented one, but it appears that this is a lot of work (although the people at Oracle managed to do it with Berkeley DB, so there may be some hope).
Built as a proof of concept for column oriented stores, this looked like it would fit the bill perfectly. However, it is no longer under active open source development, and has since transformed into the commercial product Vertica. While the source is available, it requires old versions of Berkley DB and just seemed like a pain to get up and running.
An open source column oriented database system, but unfortunately it does not appear to do a very good job of compressing data, with a 610GB workload resulting in a 650GB database (without indexes). That pretty much rules this one out for this workload. Oh, and it has no distributed querying capacity.
On the same test as MonetDB, the 610GB workload was compressed to 120GB (without indexes), which is very good. Unfortunately though its query performance seems quite poor according to this article. It also does not have any distributed querying capacity.
HyperTable - Nope (see below)
HyperTable looks really promising, and is built up from the ground to be distributed. Its HQL syntax is very similar to SQL The only problem is that it runs on top of Hadoop, which makes it somewhat tricky to setup on a shared cluster like we have at the VLSCI. I may revisit this later down the road.
Turns out HQL doesn't support aggregate queries, which are central to this project, so it is not suitable.
Anyone know of any other database's that may be worth looking at?