Login

Simple recipes for predictive analysis of big data in R

For those of you not familiar with R, it is a scripting language designed for statisticians, written by statisticians.  It’s important to remember the fact that it’s primarily written by statisticians because it is not a beautiful language in the way that Python, Haskell or C may be.  R is a language of convenience, a haphazard set of commands that take a while to get your head around as there is no real consistency between libraries (as you will see!). 

It is R’s lack of consistency and the complexity of its functions means that it is easier to be productive in R when you have a few basic recipes for simple tasks, which is the reason I put together this document.  It is easy to get confused as to what functions require a data frame, versus a matrix (and should it be transposed?), a vector versus a factor and so on.

Of course, had this analysis found a sure fire way to make money, the last thing I would have done is to publish this analysis!  So if you are looking for an article on predicting stock price movements, this is only going to frustrate you.  However if you care about learning some basic statistical analysis in R, hopefully you find this interesting.

Predictive Analysis

This post will discuss a few statistical methods for predictive analysis using supervised learning techniques.  The techniques are supervised as they require a “training” data set and a test data set which we use to validate how good our models are.  Further, we are attempting to “classify” data into groups rather than trying to predict individual values.

Data

These techniques are techniques used by Bioinformaticians to predict whether certain gene expressions levels are able to predict what disease state a patient may have.

This gets a bit complicated, so instead I will use a more accessible example, where we try to predict movements in Apple’s share price.  This is probably the worst way these techniques could be used, but it makes for a bit of fun.

Here we want to predict if Apple’s share price is going to go “down a lot” (down by more then -1%), be “unchanged” (up or down 1%), or up a lot (up more than 1%).  By defining these states, we can then use our classifier to predict what the stock price is going to do (initially I had 5 states, including “up a bit” and “down a bit”, but the results were even worse).

Our hypothesis is that there is some combination of stock changes on a Monday that will tell us something about how Apples price will move on Tuesday.  We hope to use statistics and machine learning to find that combination.

So where is the “big data”?  Well in order to build our classifier we are going to use the daily percentage change of every stock on the NASDAQ, which is about 2,000 different stocks.  We are going to do this over 200 days of trading data for our training set, then 20 days for our test set.  This is only 400,000 data points which is pretty small, but I wanted an example where I could distribute the data easily.  Bioinformaticians will use similar techniques where rather than 2,000 stocks we may have 1,000,000 different loci on the genome and rather than 200 days, we may have 20,000 samples (20,000,000,000 data points which I think classifies as “big data”).

You can download the data used in this analysis here [4mb].

 

Loading new libraries

The strength of R is in its libraries which we will make extensive use of.  To get them all installed, you will need to enter the following commands.

source("http://bioconductor.org/biocLite.R")
biocLite("limma")
biocLite("multtest")
library(rpart)
library(multtest);
install.package("glmnet");
library(glmnet);
library(class);
library(rpart);
library(e1071);

Reading in the data

Firstly download the files here, and extract them to a directory.  We then need to point R to the directory you extracted the files to (“File” > “Change dir”).  Enter the following commands:

X <- as.matrix(read.csv("training.csv"))
X.T <- t(X)
Y <- read.csv("training_changes.csv")[,]
zz <- as.vector(Y)
# Remove the "DOWN_SML" and "UP_SML" categories 
zz[zz=="DOWN_SML"] <- "UNCH"
zz[zz=="UP_SML"] <- "UNCH"
Z <- as.factor(zz)

 

X now contains our training data, which has a column for each stock and a row for each day.  Each cell is the percentage change in price for that day.  X.T contains the same data, but the rows and columns switched.  Z contains our classifications (“DOWN_BIG”, “UNCH”, “UP_BIG”) for Apple on the next day.  Y contains a higher level of accuracy, discriminating between small moves up and small moves down (less than 1% in each direction), which we will leave out as they ruined the downstream analysis!

Narrowing down the data

We have 2,580 stocks here that might have an impact on our model, but most of them probably won’t.  So what we want to do is to exclude all those that have no significance.  The way we do this is to calculate permutation adjusted p-values for step-down multiple testing procedures.  Essentially we doing an ANOVA regression for each of the different stocks to determine if the stock is significant in determining what Apple’s stock will do the next day. 

Let us assume that a stock is significant in predicting Apple’s stock if it has a p-value of 0.05 or smaller (the probability that the variance between classifications is the same as the variance within classifications).  Because we have 2,580 stocks, we would expect to find 129 stocks that are significant just due to random chance (0.05 * 2,580).  We therefore need to do some kind of “multiple testing” adjustment, which is exactly what the following command does.

maxT <- mt.maxT(X.T, Y, B=1000, test="f")
length( which(maxT[,4] < 0.05 ))
rownames(X.T)[maxT[ which(maxT[,4] < 0.05), ]$index]


This gives us 10 stocks that are significant at the 0.05 level when corrected for multiple testing ("PCO"  "CCBG" "ABAX" "ELRC" "FISI" "INTL" "HFWA" "WINA" "CDZI" "BMTC").  We will limit our downstream analysis to just these stocks, using the following code.

X.T.I <- X[,maxT[ which(maxT[,4] < 0.05), ]$index]
X.I <- t(X.T.I)


K-Nearest Neighbours Clustering

This is a simple classification system that classifies an unknown point by giving the point the same classification as the k nearest “known” points.  We can build this classifier and test it on our training data (self-validation) as follows:

X.KNN <- knn( X.T.I, X.T.I, Z, k=10)
sum(X.KNN == Z) / length(Z)


This gives us a success rate of 0.59, which is optimistic as we are using the same set to train as to validate our model.  We can use cross-validation (where we leave out a chunk of the training set to train, then test it on the remaining set of data) to get a more accurate estimate of success:

X.KNN.cv <- knn.cv(X.T.I,  Z, k=10)
sum(X.KNN.cv==Z)/length(Z)


Here we see that using cross validation, our success rate drops to 0.5, which is certainly not at a level you would feel comfortable investing your hard earned cash on.

Of course, the choice of K=10 is completely arbitrary, so we can attempt to find a better value of K by trying a large number.  We get our best value at K=7, but it’s still only 0.31.  However each time you run a cross-validation you will get a different number due to the random selection of which chunk it left out for validation.

Penalized Generalized Linear Models

Linear models are rather simple, fitting a straight line to a set of data based by minimizing the error (distance from the model).  Generalized Linear Models are a generalization of this technique that makes different [thanks LikeLeeHood of /r/statistics] assumptions about the data, enabling it to model non-normal distributions (any distribution that is a member of the exponential family of distributions, which includes binomial and Poisson).  The GLMNET package uses penalized linear models, rather than true linear models.

X.GLM <- cv.glmnet(X.T.I, Z, family="multinomial", type.measure="class")
X.GLM.M <- glmnet(X.T.I, Z, family="multinomial", lambda=X.GLM$lambda.min)
X.GLM.P <- predict(X.GLM.M, type="class", newx=X.T.I)
sum(X.GLM.P == Z) / length(Z)


Here we use cross validation to select the minimum lambda for our model (the model with the least error), then predict using the model.  When we run the model against our training data, we end up with a classification rate of 0.54, only slightly better than KNN.  One must keep in mind that these predictions are made with the same data that generated the model, which makes this poor performance even sadder.

Support Vector Machines

If you think of a linear model as trying to draw a line through a cloud of points, support vector machines are all about drawing lines to separate different clouds of points that belong to our different classifications.  In the linear case, these lines are straight, but we can step it up a notch by transforming these otherwise straight lines so that they form curves.  For where we have more than 2 dimensions, we create hyper-planes to separate our clouds out.  There is some pretty heavy math behind all this, but luckily there is an R library for that!

X.F <- data.frame(Z, X.T.I)
X.SVM.B <- best.svm(as.factor(Z)~., data=X.F, type="C-classification", kernel="radial")
X.SVM.M <- svm(as.factor(Z)~., data=X.F, type="C-classification", kernel="radial", cost=X.SVM.B$cost, gamma=X.SVM.B$gamma)
X.SVM.P <- predict(X.SVM.M, X.F)
sum(X.SVM.P == Z) / length(Z)


This yields a correct classification rate of 0.63, which is our best rate so far.

Classification Trees

Classification trees attempt to build a tree like structure where a threshold is created for the value of each stock whereby if it’s less than that value, we go down the left branch of the tree, and if its greater we go down the right branch.  After traversing down many levels of the tree, we end up at a node which should tell us the classification.  We will draw out the tree with the following commands:

X.CT.M <- rpart(Z~., data=X.F, method="class")
plot(X.CT.M, mar=c(0.1, 0.1, 0.1, 0.1))
text(X.CT.M, use.n=TRUE)


It looks pretty ugly, and is likely to have overfit the data, so we want to prune the tree to remove excess decision points to make our model more general.  This is done with the following command:

printcp(X.CT.M)
X.CT.Pruned <- prune(X.CT.M, cp=0.021739)


Where 0.021739 is the CP value that has the lowest cross validation error (in this case 1.1087).  We then run our classifier against our training set to get a feel for how good it is.

X.CT.P <- predict(X.CT.Pruned, type="class", newdata=X.F)
sum(as.vector(X.CT.P)==Z) / length(Z)


This yields a correct classification rate of 0.69, putting classification trees in the lead.

Validation

Up to this point we have only used the training data to validate our models, which is pretty much meaningless.  What we need to do is to test our models on a separate set of data.  Luckily we have a training set of data, which are the classifications over the next 20 days (4 days UP_BIG, 9 days UNCH and 7 days DOWN_BIG).  So let’s see how our models do against actual data.

X.Test <- as.matrix(read.csv("test.csv"))

# Select only the stocks we found significant earlier
X.Test.I <- X.Test[,maxT[ which(maxT[,4] < 0.05), ]$index]
Y.Test <- read.csv("test_changes.csv")[,]
zz <- as.vector(Y.Test)
zz[zz=="UP_SML"] <- "UNCH"
zz[zz=="DOWN_SML"] <- "UNCH"
Z.Test <- as.factor(zz)

 

KNN

X.Test.KNN <- knn( X.T.I, X.Test.I, Z, k=10)
sum(X.Test.Knn == Z.Test) / length(Z.Test)
[1] 0.4


GLM

X.Test.GLM.P <- predict(X.GLM.M, type="class", newx=X.Test.I)
sum(X.Test.GLM.P == Z.Test) / length(Z.Test)
[1] 0.45

 

SVM

> X.Test.F <- data.frame(Z.Test, X.Test.I)
> X.Test.SVM.P <- predict(X.SVM.M, X.Test.F)
> sum(X.Test.SVM.P == Z.Test) / length(Z.Test)
[1] 0.45

 

CT

> X.Test.CT.P <- predict(X.CT.Pruned, type="class", newdata=X.Test.F)
> sum(as.vector(X.Test.CT.P)==Z.Test) / length(Z.Test)
[1] 0.3

 

Conclusion

So none of our techniques really inspire much confidence, with none of the strategies being correct more than 45% of the time, which is not much better than guessing.

What if we take the strategy that we only buy AAPL stock when our model tells us that the next day is going to be “UP_BIG”?  Well our KNN model suggested this would happen 4 times, with 2 of those times resulting in big losses (AAPL was down by more than 1%), and the other two being “UNCH”.  Classification Trees did a bit better with its 4 predictions, getting 1 correct and 3 “UNCH”.

The best advice came from SVM and GLM, as they predicted “UNCH” for each of the days, effectively telling us not to bother.

Interestingly, both GLM and SVM predicted “UNCH” for all of the test dataset, providing perhaps the best advice – not to invest using this strategy at all!

If you feel I have made any errors in this analysis, please let me know - there has been a lot of copying and pasting between R and Word!

Linking C libraries with go-lang and "-std=c99"

I have been playing around with Google's new Go language ("golang" or "go-lang" if you are trying to Google for it!), and have been most intrigued about its ability to call C libraries.

Unfortunately, I found it very difficult to get up and running with this.  The c code I was attempting to integrate with uses the "-std=c99" flag.  Therefore, my Go file looked like:

package main
import "fmt"

// #cgo CFLAGS: -std=std99 -msse4.1 -I../c/core/
// #cgo LDFLAGS: -L. -lgotest
// #include "go-test.h"
import "C"
func main() { 
   cs := int(C.quick_test())
   fmt.Println("Hello, ", cs )
   return
}

Which when compiled would give the following output:

 

 go build tester.go
# command-line-arguments
tester.go:7:1: error: return type defaults to 'int' [-Werror]
tester.go: In function 'typeof':
tester.go:7:20: error: expected declaration specifiers before '*' token
tester.go:8:1: error: parameter '__cgodebug_data' is initialized
tester.go:11:1: error: excess elements in scalar initializer [-Werror]
tester.go:11:1: error: (near initialization for '__cgodebug_data') [-Werror]
tester.go:7:1: error: type of 'quick_test' defaults to 'int' [-Werror]
tester.go:8:11: error: declaration for parameter '__cgodebug_data' but no such parameter
tester.go:11:1: error: expected '{' at end of input
tester.go:11:1: error: control reaches end of non-void function [-Werror=return-type]
cc1: all warnings being treated as errors

 

Which is to say, nothing useful at all (although hopefully Google will find this, and someone searching for the errors will find this and not have to waste an afternoon debugging)!

Removing the "-std=std99" or replacing it with "-std=gnu99" will make everything work properly again.

 

 

 

 

Why there are not 86,000 men for 1.3 million women in Australia & why journalists can't be trusted with statistics

 

So I read the following paragraph in the Murdoch press today:

A church official told the Herald Sun newspaper there has been a massive decline in the number of available men, with statistics claiming there are just 86,000 Mr Rights for 1.3 million women aged between 25 and 34.

http://www.news.com.au/national/grab-a-man-or-miss-out-church-warns-girls/story-e6frfkvr-1226348223822#ixzz1u9VJ3rW1

http://www.heraldsun.com.au/news/more-news/catholic-church-says-single-women-are-being-too-fussy/story-fn7x8me2-1226348238486

So that means that there a woman has only a 5% chance of landing a good man!  Oh my, what a problem!

The article then breaks it down for us, such that:

Demographer Bernard Salt has calculated that of the 1.343 million men in the same age bracket, only 86,000 single, heterosexual, well-off, young men were available after excluding those who were already married (485,000), in a de facto relationship (185,000), gay (7000), a single parent (12,000) or earning less than $60,000 a year.

So let’s think about that for a second.  Who are the 485,000 men married to?  Who are the 185,000 men in de-facto relationships with?   Well since they are not in the 7,000 gays, they are probably in relationships with women. That leaves only 630,000 women who are similarly available.

The 12,000 men who are single parents are also likely to have created that child with a woman and unless its one woman with 12,000 children to different fathers, one may assume that there are a similar, if not more women who are single parents.  So now we are down to 618,000 "eligible" women.

But wait, the article states that there are only 86,000 eligible men.  This seems to be because 568,000 men who are excluded from being "eligible" because they earn less than $60,000.

Aside from the fact that money shouldn't rule out love, this seems to be a rather high standard to be excluding men, especially when it excludes more than half a million otherwise eligible bachelors.  What are these men to do, are they doomed never to find love? 

How many "eligible" women would there be if we also applied the $60,000 cut off?

This number is hard to find, so we'll have to make some guesses.  According to the ABS, the average weekly earnings for a man 25-29 is $1,082 and 30-35 is $1,271 (so let’s split the difference and call that $61,100 per year).  For a woman, the average weekly earnings for a 25-29 year old is $904, and 30-35 year old is $923 which using the same technique works out at $47,502.

(http://www.eowa.gov.au/Information_Centres/Resource_Centre/Statistics/EPD%201...)

It is therefore fairly clear that fewer women are going to meet this wage criteria than men.

Therefore, rather than their being 15 woman for each "eligible" man, it seems fairly clear that there are in fact many more eligible men (using the ridiculous $60,000 criteria for what eligibility) than there are women.

 

 

 

 

 

 

 

 

 

 

Building data structures that are smaller than an Array AND faster (in C)

This is being discussed over at Hacker News, check it out.

 In developing software for large data sets (billions of records, terabytes in size) the way you store your data in memory is critical – and you want your data in memory if you want to be able to analyse it quickly (e.g. minutes not days).

Any data structure that relies on pointers for each data element quickly becomes unworkable due to the overhead of pointers.  On a 64 bit system, with one pointer for each data element across a billion records you have just blown near 8GB of memory just in pointers.

Thus there is a need for compact data structures that still have fast access characteristics.       

The Challenge

The challenge is to come up with the fastest data structure that meets the following requirements:

  • Use less memory than an array in all circumstances
  • Fast Seek is more important than Fast Access
  • Seek and Access must be better than O(N).


Where Seek and Access are defined as:

Access (int index): Return me the value at the specified index ( like array[idx] ). 

Seek (int value): Return me all the Indexes that match value.

(The actual return type of Seek is a little different, but logically the same.  What we need to return is a bitmap where a bit set to 1 at position X means that value was found at index X.  This allows us to combine results using logical ANDs rather than intersections as detailed here)

 

Possible solutions

Array (array) 

The first solution which we compare all others too is the humble integer array. 

Memory: 4 bytes for each integer, making space complexity O(N).

Access: Nice and fast, O(1).

Seek: Awful! We need to check every item in the list and is O(N).  Not good enough!

 

Wavelet trees (wtree)

This is a type of “succinct data structure”, which is pretty incredible.  It actually blew my mind when I first read about it here

The structure computes a list of all the available keys (distinct values, ordered of length K), then stores a bit array that is N bits long (the number of values).  It then needs Log2(K) such bit arrays, which I will refer to as rows.  Essentially the bitmap is a representation of a binary search tree.  

The data structure takes a bit of work to get your head around, so I will explain it with an example.  Say we have the data:

3 2 4 6 2 6 (Keys: 2, 3, 4, 6)

If we store the keys, we can re-write our data as:

            1 0 2 3 0 3

Simply by storing the indexes to our keys array (our first element in our data is 3, which can be found at keys[1]). Easy yeah?

Now we build out first bitmap row.  Looking at our new data set, our minimum is 0, our maximum is 3, so if we do (max – min) / 2 we get 1, which we will use as our middle value.  So to write out first row, we will write a 0 if the value in our index is less than or equal to our middle value (in this case 1) and 1 if it’s greater.

This means our first row is:

            0 0 1 1 0 1

For our next row we group all our ones and zeroes together, such that we have a) and b), where a) is made up of all the zeroes and b) is made up of all the ones.

Group a) is now made up of the following values 1 0 1 which we use to repeat the same process as the first row.  We now use the “middle” value from the last row as our “maximum”, and keep the same minimum.  We then calculate our new “middle” (max – min) / 2 = (0 – 1) / 2 = 0.  This means for Group a) we need to write a zero for each 0 and a one for each 1 (that’s worked out nicely hasn’t it!).  Therefore the bitmap for Group a) is:

            1 0 1

Group b) is made up of the following values: 2 2 3, so we can repeat the same process as above.  Here our max is 3, min is 2 and our middle value is 2.  This means our final bitmap for Group b) is:

            0 0 1

Meaning our representation of the 6 values is now:

            0 0 1 1 0 1 (first row)

            1 0 1 0 0 1 (Group a + Group b)

Awesome, our 6 values are now represented by 4 integers and 12 bits.  You can see how for each additional value we add (provided we don’t add more keys) only adds a pair of bits to our storage size, making this structure very efficient.

Access(index) 

Accessing these structures is a little more complex than accessing an array. Essentially we walk down the “rows” checking whether the bit at the corresponding index is set.  The key here is that at each level our index needs to be modified to compensate for the fact we have “grouped” values at each level. 

So if we are looking for the value at index 3 (which we know to be 2), we first inspect the bit at index 3 of row 0.

            0 0 1 1 0 1

Since this is a 1, our index needs to change to be the first index of group B.  This is done by counting the number of Zeroes in the first  complete row, which gives us the point where our group starts.  By then also tracking how many “1” bits are set before my index, I can adjust the index to check for the next row.

The bad news is that this means counting a lot of bits.  The good news is than Nehalem processors have an SSE4.2 instruction that enables us to count 0’s and 1’s VERY quickly (popcnt).  The bad news is that the indexes are not necessarily aligned to words, so additional work is required in aligning bitmaps to word boundaries (which I have not done).

Seek(value) 

Seek is much more complex, but works by traversing our keys list as if it was a binary search tree.  Here we first take the middle value of our keys, and compare it to our value.  If our value is larger than the middle value, our bitmask tells us which values are also larger. For example you can think of the first row as telling us which numbers are bigger than 1 (where 1 is set), and which are smaller (where 0 is set).

The goal of the seek to return a bitmap the same length as our rows, with a 1 set wherever the value matches the corresponding index.

So if the value we are searching for is larger than the middle value, our temporary mask becomes the first row.  If it was smaller, then we flip the bits on the first row.

We are then directed to either group a) or group b) depending on the result of the first comparison (group a if its smaller, or group b if its larger.   At this point we find out what the “middle” value is for this group and compare it to the value we are seeking.  If our seek value is larger than this value, we can just use bits as is.  If its smaller or equal then we need to invert it.  We then use this as a “group mask” which we use to refine our temporary mask.

The refine operation simply looks at all the 1 bits in our temporary mask, and works out what number bit it is in the mask if there were no 0’s present. (thus in “0 0 1 0 1 0 ”, the index of the first 1 would become 0, and second 1 would be 1).  We then use the “group mask”, which tells us which of those set bits need to be set to one.  Take the following example:

0 0 1 1 0 1      (temp mask)

0 1 0               (group mask – note it is 3 bits long the same as the number of 1’s in temp mask)

0 0 0 1 0 0      (result, the first and third ones are set to 0).

I have implemented the code to do this, but it is far from efficient (if you like making C code fast, this may be a good project for you!).  In my implementation I also store the “rank indexes” (the start of each group) so as to prevent the entire bitmap from needing to be scanned all the time, which makes it faster but use more memory.

Hybrid (tmap)

Given the complexity of Wavelet tree’s and its disappointing performance, I developed a hybrid, which stores values in the same way as a wavelet tree, but does not group values.  This makes accesses much faster as the index doesn’t change.  It also makes seeks much faster than an array, as we just do a logical AND down the rows.  Since our bitmaps are implemented in 64bit integers, and we have only Log2(K) rows, we can test (64 / (Log2(K) -1)) values per operation. 

Performance

So how do they perform?

 

Size

Values

array

wtree

tmap

      1,000

          4,016

      9,016

          3,956

     10,000

         40,016

     23,180

         16,680

    100,000

        400,016

    135,560

        129,160

  1,000,000

      4,000,016

  1,260,700

      1,254,200

 10,000,000

     40,000,016

 12,510,700

     12,504,200

 

Access Performance

Values

array

wtree

tmap

      1,000

              1

      3,956

              2

     10,000

              1

     16,680

              2

    100,000

              1

    129,160

             10

  1,000,000

              1

  1,254,200

            100

 10,000,000

             30

 12,504,200

          1,160

 

Seek Performance

Values

array

wtree

tmap

      1,000

              2

         20

              1

     10,000

              2

        260

              1

    100,000

            100

      2,630

             40

  1,000,000

            940

     26,430

            460

 10,000,000

         10,520

    278,500

          8,420

 

 Conclusion

There is no doubt that on paper, the Wavelet tree structure should perform the best out of all the structures, but as we can see here, its performance is AWFUL.  I have no doubt that there is plenty of room for optimization (particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.

So the challenge remains open.  Can you make a faster, smaller array than those described here?

You can view the source code for this project at https://github.com/siganakis/

 

Why are column oriented databases so much faster than row oriented databases?

I have been playing around with Hybrid Word Aligned Bitmaps for a few weeks now, and they turn out to be a rather remarkable data structure.  I believe that they are utilized extensively in modern column oriented databases such as Vertica and MonetDB.

Essentially HWABs are a data structure that allows you to represent a sparse bitmap (series of 0's and 1's) really efficiently in memory.  The key trick here is the use of run length encoding to compress the bitmap into fewer bits while still allowing for lightening fast operations.  

They key operation from my perspective is the "AND" operation.  "Why is that useful?" I hear you ask.  Well, imagine we have the following query:

SELECT Animal, Colour, COUNT(*) 
FROM AnimalDetails 
WHERE City='Melbourne'
GROUP BY Animal, Colour

On a table that looks like:

Animal, Colour, City
Dog, White, Melbourne
Cat, Black, Sydney
Cat, White, Melbourne
Dog, Black, Melbourne

In order for me to process the query, I need first to calculate the WHERE condition, which is to say find all the rows that satisfy the condition "City='Melbourne'".  

Then, given those rows (a list of row ids) I need to group the results to Animal and then by Colour.

Imagine for a moment that our data for "Animal", "Colour" and "City" are all stored as simple arrays, with the array index 0 for all arrays creating row 0, so Animal[0] = Dog, Colour[0] = White, City[0] = Melbourne, forming the first row.

That would work well, but its a bit clunky since I have to visit every row in the table to satisfy the query.  What I really want to do is to index the data, so I change the structure of the data, such that each Array is now a map using the distinct values as keys and a list of the row_ids that the value is found in.  Then my "Animal" column can be represented as:

Animal["Dog"] = {1, 4}
Animal["Cat"] = {2, 3}

Indicating that for the column "Animal" the value "Dog" can be found at rows 1, 4.  This is repeated for all the other columns.

This means that satisfying my WHERE condition means I only need to seek to the "Melbourne" value in my City structure, and read off the list of row_ids {1, 3, 4}.  Awesome!

It then also means that for my group by conditions I can do the same thing, but a little differently.  Since I have a list of all the distinct values, I can just walk through the list of distinct values and collect the list of row ids, then Intersect it with the result of the WHERE condition.

So to find the Distinct values of "Animal" that satisfy the condition I end up with:

Seek for "Dog" in Animals, return {1, 4} then INTERSECT {1, 3, 4} = {1, 4}
Seek for "Cat" in Animals, return {2, 3} INSERSECT {1, 3, 4} = {3}

I can then take those results and intersect them with the results of "Colour":

Dogs
"White" {1, 2} INSERSECT {1, 4} = {1} (White Dog)
"Black" {3, 4} INSERSECT {1, 4} = {4} (Black Dog)

Cats
"White" {1, 3} INTERSECT {3} = {3} (White Cat)
"Black" {2, 4} INSERSECT {3} = {} (No Black Cats)

The problem here is all these INTERSECT operations are really expensive, especially when the conditions end up producing a lot of rows.  

On my test data set (a million rows), a similar query to this took about 4000ms, with the vast majority of time being taken up in calculating the intersections.

Now, if you structure the data a little bit differently you can represent the values as bitmaps.

Animal["Dog"] = 1001
Animal["Cat"] = 0110

Here, "Dog" can be found at row id 1 and 4.  Therefore we set the value of the bit at rows 1 and 4 to 1, and 0 elsewhere.

Now my first index seek to satisfy my WHERE Condition is just a matter of looking up the Bitmap for "City = Melbourne", which I see is 1011.

I can then find my bitmaps for "Animal" and do the AND operation on the bitmaps rather than the intersection, giving me:

"Dog" 1001 AND 1011 = 1001 (rows {1,4})
"Cat" 0110 AND 1011 = 0010 (rows {3})

In my example, running the query using Bitmaps now takes in the order of 40ms, or about 100 times faster.

My next post will probably go into a bit more detail about how Hybrid Word Aligned Bitmaps actually work.  But if you are curious have a look at: 

https://sdm.lbl.gov/fastbit/

Or a pretty good .Net implementation:

http://softwarebotanysun.codeplex.com/

 

Cheers!

 

 

 

Compiling .Net (C#) on Red Hat Enterprise Linux without Mono installed

Our super computing cluster at the VLSCI is currently running Red Hat Enterprise Linux.  While we can simply send the helpdesk a request to get pretty much any free applications / libraries installed on the cluster, asking them to install Mono seemed embarrassing.  So I set about making a simple .Net assembly work on the cluster.

Mono includes a tool called "mkbundle" which allows for the bundling of .Net assemblies into single executables with no dependencies on mono.

I run Linux Mint for my normal Linux development, which doesn't seem to be fully compatible with RHEL, due to differences in glibc versions and countless other things.  Therefore, running the usual mkbundle run on my executable "mono-mem.exe" (which was compiled by VS2010 on windows):

mkbundle mono-mem.exe  -o mono-mem.broken  --deps --static -z

(where "mono-mem.broken" is our output and "-z" tells it to zip the libraries)

Will not result in an executable that will run on RHEL, instead you'll end up with output like:

 

[terences@merri mono-mem]$ ./mono-mem.broken 
./mono-mem.broken: /lib64/libc.so.6: version `GLIBC_2.9' not found (required by ./mono-mem.broken)
./mono-mem.broken: /lib64/libc.so.6: version `GLIBC_2.11' not found (required by ./mono-mem.broken)
./mono-mem.broken: /lib64/libc.so.6: version `GLIBC_2.8' not found (required by ./mono-mem.broken)

Which is not helpful at all.

Rather, what we need to do is to use mkbundle to generate a library with all the required mono stuff in it, and a c file containing our entry point to the app such that the c file can be compiled on our server (and thus get its dynamic libraries all sorted out).

Turns out this is a bit tricky, but doable.

Firstly, we use mkbundle to generate the c file and the bundle of libraries with this command:

/mkbundle -c -o mono-mem.c -oo bundle.o --deps mono-mem.exe reference_library.dll
(Where "-c" tells it to build the c file but not the final assembly "-oo" builds the bundle library and "reference_library.dll" is a dll that my app needs).

From this, we copy mono-mem.c to our cluster along with bundle.o.

But, we also need the header files and library files from mono for the RHEL / CentOS platform.  To get access to these, I needed to create a CentOS install with mono and then copied the contents of "lib64" and "include" from the mono directory.  For your convenience I have extracted these and created an archive that you can download here:

With those files extracted to your working directory, compiling is as simple as:

cc mono-mem.c bundles.o /PATH_TO_LIBS/mono-2.10/lib64/libMonoPosixHelper.so /PATH_TO_LIBS/mono-2.10/lib64/libmono-2.0.a -I/PATH_TO_LIBS/mono-2.10/include -L/PATH_TO_LIBS/mono-2.10/lib64 -lpthread -lc -lrt -lm -ldl -lz -omono-mem.o
(Where PATH_TO_LIBS is the path to where you extracted the libraries).

Then it is just a matter of updating your LD_LIBRARY_PATH to include your mono libs:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/vlsci/VR0002/terences/build/mono-2.10/lib64/

And the app should run perfectly. 

 

 

Assessing database engines for use in GQL (Genome Query Language)

In an ideal world, it would simply be a case of using SamTools to load a BAM file into a database that we could query directly using SQL.  Unfortunately, I have yet to be able to find an appropriate database that meets my following requirements:

Open Source
In order to run software on a super computing cluster, either the software must be open source, of I have to be able to convince the super computer administrators to buy a license.  Given the arduous process and expense associated with purchasing licenses for a 8,000+ cores, this pretty much rules out anything that is not open source.

Distributed
The individual nodes as a part of the VLSCI cluster are not especially powerful, its the running many nodes at a time that should enable the fast processing of data.

In-memory
The shared disk subsystem that the nodes use is extremely slow when compared to accessing memory directly.  In order to process these data volumes quickly to enable near real-time querying and modifications to queries, data must be stored in memory.

Column Oriented
The genomic data from BAM files suits a column oriented database better than a row oriented one due to improved compression, meaning that less memory is required.  Less memory means fewer nodes are required and thus it will be easier to actually run the software (smaller jobs move up the queue faster than larger jobs).

With this in mind, I have looked at the following database systems:

SQLite & Python
My first cut used SQLite, with a python layer on top to enable distributed processing of queries (with inter node communication happening over HTTP).  This has worked rather well on the cluster, as I can just qsub the head-node script, then qsub as many worker nodes as I want.  

A python script reads the BAM file and writes 100 files, with each file containing 1/100th of the total reads.  When worker nodes are created, the head-node instructs each worker to load a file in turn until all the 100 files are evenly distributed over the worker nodes.

The actual query needs to be broken down into two sub queries, with one being run on each of the worker nodes.  The output from each of the worker nodes is then sent to the head node.  The head-node then executes the second query against the combined output, providing the final output.

The primary issue here is the memory utilization.  Against the 300GB of normal/tumour paired data, about 110GB of memory would be required (e.g. 11 nodes with 10GB of memory each), and that is without any indexing at all.  

Due to its simplicity and the reliance on Python (which I am in love with almost as much as I am in love with C#), this should be very easy to implement.

C
By building a memory efficient data structure (before applying any compression at all), it is possible to reduce the memory requirements by a half, meaning the entire 300GB data set could fit into about 50GB, with the data being separated out into 10MB "pages".   By using MPI, it should then be possible to distribute pages amongst many different nodes.  During querying, each "page" could be queried separately in parallel, with the output from each page then being collected to form the final output.

The key disadvantage of this method is that you lose the SQLite query engine, meaning I have to build my own SQL parser and query engine.  This will be rather time consuming, especially since subqueries and joins are vital parts of this.  

I had hoped that I could replace SQLites internal storage engine ("b-tree") with my own column oriented one, but it appears that this is a lot of work (although the people at Oracle managed to do it with Berkeley DB, so there may be some hope).

C-Store
Built as a proof of concept for column oriented stores, this looked like it would fit the bill perfectly.  However, it is no longer under active open source development, and has since transformed into the commercial product Vertica.  While the source is available, it requires old versions of Berkley DB and just seemed like a pain to get up and running.

MonetDB
An open source column oriented database system, but unfortunately it does not appear to do a very good job of compressing data, with a 610GB workload resulting in a 650GB database (without indexes).  That pretty much rules this one out for this workload.  Oh, and it has no distributed querying capacity.   

LucidDB
On the same test as MonetDB, the 610GB workload was compressed to 120GB (without indexes), which is very good.  Unfortunately though its query performance seems quite poor according to this article.  It also does not have any distributed querying capacity. 

HyperTable - Nope (see below)
HyperTable looks really promising, and is built up from the ground to be distributed.  Its HQL syntax is very similar to SQL  The only problem is that it runs on top of Hadoop, which makes it somewhat tricky to setup on a shared cluster like we have at the VLSCI.  I may revisit this later down the road.  

Turns out HQL doesn't support aggregate queries, which are central to this project, so it is not suitable. 

 

Anyone know of any other database's that may be worth looking at?

 

Genome Query Language

The problem

Whole genome data sets are rather large, with alignment files (.BAM) often topping 150GB or more. In cancer research this is even more difficult, as for each patient we often have a normal sample, a tumor sample and sometimes post-treatment samples.  Multiply this by the number of patients within your sample and all of a sudden we are talking about petabytes of data, all of which needs to be analysed.

This makes analysis really difficult, as many analysis tools will take days if not weeks to run on these types of data sets.  Even more challenging is that many of these tools are fragile, or have many settings that need to be tweaked to produce meaningful results.  

This leads to the situation where you run a tool (wait 5 days), realize you have messed up a setting, run it again (wait 5 days) realize your data has another peculiarity and needs to be further filtered, run it again (wait 5 days), and so on.  Before you know it, a month has passed and you still have meaningless data.

Even more annoying is that my specific project has access to super-computing resources at the VLSCI - but none of the tools are really built to make use of the high powered cluster that we have. 

The solution

A simple, SQL like language that enables you to interrogate genomic information in real time by leveraging high performance computing clusters.  By utilizing a query language, it will be possible change how data is aggregated and filtered in a simple, consistent manner. 

An example of a query

 Say you wanted to compare copy number differences between a Tumor sample and a Normal sample.  You decide the best way to do this is to find all the 2000bp regions where there are more than twice as many reads as there are in the normal sample.  A query may look like:

 

SELECT 

 NormalBuckets.BucketPos,

 TumorBuckets.Reads AS TumorReads,

 NormalBuckets.Reads AS NormalReads

 FROM

(

SELECT

 -- Number of reads in the bucket
COUNT(*) AS Reads,

-- The position, seperated into 2000 nucleotide buckets
Bucket(T.AlignedPosition, 2000) AS BucketPos,

FROM Tumour T

-- Filtering
WHERE 

-- No PCR Duplicates
T.IsDuplicate = 0

-- Must have atleast 49 matches (out of our 50BP reads)
AND T.Matches >= 49

-- Must not have any hard clips
AND T.HardClips = 0

-- The Probability of a wrong alignment must be less than 0.05
AND T.MapQ < 0.05

-- The average probability that a base call is wrong must be less than 0.05
AND T.AvgQuality < 0.05

GROUP BY 

Bucket(T.AlignedPosition, 2000)

) TumorBuckets

INNER JOIN 

(

SELECT

COUNT(*) AS Reads,

Bucket(N.AlignedPosition, 2000) AS BucketPos,

FROM Normal N

-- Filtering
WHERE 

N.IsDuplicate = 0

AND N.Matches >= 49

AND N.HardClips = 0

AND N.MapQ < 0.05

AND N.AvgQuality < 0.05

GROUP BY 

Bucket(T.AlignedPosition, 2000)

) NormalBuckets
ON TumorBuckets.BucketPos = NormalBuckets.BucketPos

-- We want more than twice as many reads in the tumor bucket than the normal one
AND TumorBuckets.Reads > NormalBuckets.Reads * 2

 

This way we have a simple definition of what we want to achieve.  There is no need to spend time researching the samtools interface, no hunting for different tools, modifying / recompiling tools.  Just a simple definition of what you want to find.

And, the goal is that such a query should return (on 300GB of data) in under a minute on a 10 node cluster.  That means you can experiment with filtering changes in near real time.

I plan on making this open-source when its ready, and it will be hosted on github.com

Cheers!

A little bit about ABI SOLiD Sequencing

The following is an except from my research project into ovarian cancer at the Peter MacCallum Cancer Centre 

Solid sequencing

SOLiD sequencing uses “color-space”, which binds fluorescent primers to the DNA two nucleotides at a time.  For example, a “blue” on the first di-nucleotide pair (bases 2 & 3) will correspond to a double of any nucleotide (e.g. AA, TT, GG, CC).  The di-nucleotide to be read (bases 3 & 4) may then be “green” (AT, CG, GC or TT).  If we know that the nucleotide 1 is a “C”, then the “blue” call can only correspond to nucleotide 2 being a C.  Since we know that nucleotide 2 is a C, then “green” can only correspond to nucleotide 3 being a G.  Thus it is important that the true base of the first nucleotide is known, as any mistake will mean that the entire read is incorrect.  The first base is generally well known, as it is the last base of the adapter sequence (Applied Biosystems, 2008). 

When a reference genome is also available, the “colour space” measurements enable the detection of errors.  Without additional error correction a single error colour base call will affect the calling of all downstream nucleotides.  However, where a reference is known, single incorrect calls can be detected when compared to the reference as they will appear as a read that matches perfectly (in color space) aside from one call, enabling the misread colour to be detected.  Additionally, since a SNP will require two colour changes there is a clear distinction between a SNP call and a single erroneous call. 


(Applied Biosystems, 2008)

The BioScope alignment software supplied by Applied Biosystems uses all this information to produce the alignment that we will be utilizing. 

Long mate pair & library preparation

Long mate pairs enable the sequencing of two ends of a single strand of DNA that is of a known length.  The distance between the two reads is known as the insert size, which is known prior to the alignment.  By comparing the known insert size to the insert size following alignment; it is possible to detect structural variations such as tandem duplications. 

In order to sequence DNA in this way, the DNA is randomly sheered, creating a distribution of different length DNA fragments.  Fragments are then selected for based on their size (in our case 1500 bp), and are capped with adaptors by ligation.  The capped fragments are then circularized, biding the two ends of the fragment.  A nick translation reaction enables the circularized fragment to be cut from both sides of where the two ends are joined.  By controlling the time and temperature of the reaction, the position of the cuts away from the join can be determined to control the length of DNA at each end (e.g. 50bp).    The now bare ends of the fragment are then ligated with adaptors and are suitable for sequencing.  This process is shown in the following diagram:

(Applied Biosystems, 2010)


 

Errors

Despite the error correcting technologies employed by this sequencing technology, the data is far from perfect.  The detection of sequencing errors by relying on the reference sequence is not infallible as it may not be possible to know if a read that fails to map is because of a single colour space read error, or because read actually covers a structural variation.  Obviously, the statistics far favour the likelihood of an error in reading (which is claimed to be 0.1%) (Applied Biosystems 2008) versus the chance of the read being correct and their being a structural variation, but when all these reads are corrected in this manner, it may be difficult to find “true” structural variations.

The library preparation process is difficult and needs to be exacting.  The selection of fragments by size requires meticulous wet-lab work, with the difference between selecting for 600bp fragments and 6kb fragments being the difference between 1% Agarose solution and 0.8% Aragose solution (Applied Biosystems 2010).  This process will always lead to significant variability in the length of the fragments.  Given that it is the insert size that is the primary signal we are looking for, this is something that needs to be corrected for.  However, the orientation of the pairs should not suffer from this variability to the same degree due to the simpler chemistry of how the different pair adaptors are joined to the fragment.

Structural variation detection: Second generation sequencing

The following is an except from my research project into ovarian cancer at the Peter MacCallum Cancer Centre 

There are four primary methods of detecting structural variation (SV) within second-generation sequencing.  These include: read pair discordance, read depth analysis, split read analysis and sequence assembly.  Sequence assembly is not yet feasible with short-read whole genome human data at this point, so will not be discussed.

Read pair discordance

As mentioned previously, both SOLiD and Illumina sequencing technologies provide the opportunity to read two ends of the same strand of DNA.  The sequenced strands are of a known length, so there is an expectation that the two reads should be within a known distance of each other on the genome.

Paired end discordance methods take the reads which have been aligned to the reference genome and look at how far away the two pairs align.  If the first read is found the expected distance from the second read (and in the correct orientation) then the pair is said to be concordant.  If not, the read is discordant and may provide evidence that the sample genome differs structurally from the reference genome, providing evidence of a structural variation.

There are a large number of tools that utilize this basic methodology for the detection of structural variation.  A key metric for the use of these tools is the amount of citations that the tool has received, which are summarized below.

 

Tool

Author

Journal

Year

Citations (Oct-2011)

BreakDancer

Chen

Nature Methods

2009

68

VariationHunter

Hormozdiari

Genome Research

2009

52

PEMer

Korbel

Genome Biology

2009

38

MoDIL

Lee

Nature Methods

2009

34

GASV

Sindi

Bioinformatics

2009

18

NovelSeq

Hajirasouliha

Bioinformatics

2010

14

SVDetect

Zeitouni

Bioinformatics

2010

7

SLOPE

Abel

Bioinformatics

2010

6

 

 

Due to the number of tools available and the limitations of the word limit of this piece, I will only discuss BreakDancer in detail.

BreakDancer

BreakDancer uses the aligned genome by way of SAM or BAM files to look for areas within the sample genome that contain more discordant pairs than would be expected through random chance.  These regions are then classified into six types: normal, deletion, insertion, inversion, intra-chromosomal translocation, inter-chromosomal translocation.  Categorization is done depending on the size of insert size discordance and the orientation of the reads.  Regions with two or more discordant reads are considered for further analysis using a Poisson model that considers the number of supporting reads, the size of the anchoring region and the coverage of the genome.  The type of the structural variation call is then decided by type with the most anchored reads.

A key difficulty with using BreakDancer is the number of false positives that the tool creates, as seen by the following table:

 

 

Normal 

Tumour 

Inversions 

8,187 

1,304 

Deletions 

129 

8 

Intra-chromosomal translocations 

1,709 

4 

Total 

10,025

1,315

 

 

It should be noted that the configuration run, BreakDancer was not set to look for inter-chromosomal translocations.

These results imply that the normal sample has more structural variation than the tumour sample which is known not to be the case.  Interpreting these results will be a key aspect of this research project. 



Read depth techniques

A key advantage of read depth techniques is there ability to detect SV within highly repetitive regions of the genome, as “paired-end mapping frequently cannot unambiguously assign end sequences in duplicated regions, making it impossible to distinguish allelic and paralogous variation.” (Alkan, Kidd et al. 2009).

MrFAST

MrFAST (micro-read fast alignment search tool) is a tool designed to detect CNV using second generation sequence data.  It is also the most popular SV tool described in this review, having been cited by 103 papers as of October 2011.  There is also an additional version called “DrFast” which is designed for SOLiD color-space data.

Operation
MrFAST attempts to align the raw reads to a reference genome, much like aligners such as BioScope or MAQ.   However MrFAST differs in two key details.  Firstly, it does not attempt to map the full reads, instead breaking a read up into k-mers (with a default length of 12), which are then aligned.  Secondly, most aligners when faced with a read that matches multiple loci within the genome will select a loci at random.  MrFAST on the other hand will map the read to all matching loci in order to reduce variability.  Additionally MrFAST also tracks the “edit distance” for each read at each loci to both reduce the impact of sequencing error and also to enable the calling of SNPs, which is important in determining whether a called copy is actually functional or not.(Alkan, Kidd et al. 2009)

Validation
The results called by MrFAST were validated using Array-CGH and FISH Analysis. It is difficult to quantify the success rates of the Array-CGH validation as they only sought to validate those called duplication intervals that were not shared across all three of their samples, in which case they found a validation rate of 68%.  They also used FISH analysis to validate 11 duplicated loci that were different between two of their samples, finding the FISH results to be “highly consistent with the absolute copy number predicted by MrFAST” (Alkan, Kidd et al. 2009).

 

 

CNV-Seq

CNV-Seq uses a different model to MrFAST, in that rather than seeking to detect the absolute number of copies, CNV-Seq uses a comparative technique enabling the detection of differences between two samples.  This is of particular interest for cancer data where we have two sample, tumour and normal and the primary interest is in the differences.

Operation
Both samples are aligned to a reference genome.  Using a sliding window, the read depth of each window is calculated for each sample.  The read depth distributions are compared using a Poisson model (using a normal approximation) that enables the calculation of a probability that the difference is the result of random chance (Xie and Tammi 2009).

Validation
Calls made by CNV-Seq were validated against the low-coverage genomes of Dr Craig Venter and Dr J Watson (at 7.5x and 7.4x coverage), with the results compared to known regions achieving a 50% overlap of known regions.  Results were also compared to a-CGH micro-array experiments with the majority of calls not being validated, however this was seen as evidence of the superiority of the sensitivity of the CNV-Seq technique.

 


 

Split read techniques
Split read techniques attempt to locate the exact junction of a break point by looking for reads which have “hard clips”, which is to say that a large portion of the read maps to the reference genome, but the remaining section does not.  One possible explanation for a hard clip is that their is a structural variation in the region that the read spans. Thus part of the read may map to one loci, and the remaining hard clipped region may map to another - providing evidence of a structural variation.

This technique requires long reads (the CREST algorithm requires reads of > 75bp in length (Wang, Mullighan et al. 2011)) to be effective.  With shorter reads, the hard-clipped region is typically very short, meaning that it may map to a very large number of regions within the genome making it impossible to detect the secondary loci.