Genome Query Language

The problem

Whole genome data sets are rather large, with alignment files (.BAM) often topping 150GB or more. In cancer research this is even more difficult, as for each patient we often have a normal sample, a tumor sample and sometimes post-treatment samples.  Multiply this by the number of patients within your sample and all of a sudden we are talking about petabytes of data, all of which needs to be analysed.

This makes analysis really difficult, as many analysis tools will take days if not weeks to run on these types of data sets.  Even more challenging is that many of these tools are fragile, or have many settings that need to be tweaked to produce meaningful results.  

This leads to the situation where you run a tool (wait 5 days), realize you have messed up a setting, run it again (wait 5 days) realize your data has another peculiarity and needs to be further filtered, run it again (wait 5 days), and so on.  Before you know it, a month has passed and you still have meaningless data.

Even more annoying is that my specific project has access to super-computing resources at the VLSCI - but none of the tools are really built to make use of the high powered cluster that we have. 

The solution

A simple, SQL like language that enables you to interrogate genomic information in real time by leveraging high performance computing clusters.  By utilizing a query language, it will be possible change how data is aggregated and filtered in a simple, consistent manner. 

An example of a query

 Say you wanted to compare copy number differences between a Tumor sample and a Normal sample.  You decide the best way to do this is to find all the 2000bp regions where there are more than twice as many reads as there are in the normal sample.  A query may look like:

 

SELECT 

 NormalBuckets.BucketPos,

 TumorBuckets.Reads AS TumorReads,

 NormalBuckets.Reads AS NormalReads

 FROM

(

SELECT

 -- Number of reads in the bucket
COUNT(*) AS Reads,

-- The position, seperated into 2000 nucleotide buckets
Bucket(T.AlignedPosition, 2000) AS BucketPos,

FROM Tumour T

-- Filtering
WHERE 

-- No PCR Duplicates
T.IsDuplicate = 0

-- Must have atleast 49 matches (out of our 50BP reads)
AND T.Matches >= 49

-- Must not have any hard clips
AND T.HardClips = 0

-- The Probability of a wrong alignment must be less than 0.05
AND T.MapQ < 0.05

-- The average probability that a base call is wrong must be less than 0.05
AND T.AvgQuality < 0.05

GROUP BY 

Bucket(T.AlignedPosition, 2000)

) TumorBuckets

INNER JOIN 

(

SELECT

COUNT(*) AS Reads,

Bucket(N.AlignedPosition, 2000) AS BucketPos,

FROM Normal N

-- Filtering
WHERE 

N.IsDuplicate = 0

AND N.Matches >= 49

AND N.HardClips = 0

AND N.MapQ < 0.05

AND N.AvgQuality < 0.05

GROUP BY 

Bucket(T.AlignedPosition, 2000)

) NormalBuckets
ON TumorBuckets.BucketPos = NormalBuckets.BucketPos

-- We want more than twice as many reads in the tumor bucket than the normal one
AND TumorBuckets.Reads > NormalBuckets.Reads * 2

 

This way we have a simple definition of what we want to achieve.  There is no need to spend time researching the samtools interface, no hunting for different tools, modifying / recompiling tools.  Just a simple definition of what you want to find.

And, the goal is that such a query should return (on 300GB of data) in under a minute on a 10 node cluster.  That means you can experiment with filtering changes in near real time.

I plan on making this open-source when its ready, and it will be hosted on github.com

Cheers!