Simple recipes for predictive analysis of big data in R

For those of you not familiar with R, it is a scripting language designed for statisticians, written by statisticians.  It’s important to remember the fact that it’s primarily written by statisticians because it is not a beautiful language in the way that Python, Haskell or C may be.  R is a language of convenience, a haphazard set of commands that take a while to get your head around as there is no real consistency between libraries (as you will see!). 

It is R’s lack of consistency and the complexity of its functions means that it is easier to be productive in R when you have a few basic recipes for simple tasks, which is the reason I put together this document.  It is easy to get confused as to what functions require a data frame, versus a matrix (and should it be transposed?), a vector versus a factor and so on.

Of course, had this analysis found a sure fire way to make money, the last thing I would have done is to publish this analysis!  So if you are looking for an article on predicting stock price movements, this is only going to frustrate you.  However if you care about learning some basic statistical analysis in R, hopefully you find this interesting.

Predictive Analysis

This post will discuss a few statistical methods for predictive analysis using supervised learning techniques.  The techniques are supervised as they require a “training” data set and a test data set which we use to validate how good our models are.  Further, we are attempting to “classify” data into groups rather than trying to predict individual values.


These techniques are techniques used by Bioinformaticians to predict whether certain gene expressions levels are able to predict what disease state a patient may have.

This gets a bit complicated, so instead I will use a more accessible example, where we try to predict movements in Apple’s share price.  This is probably the worst way these techniques could be used, but it makes for a bit of fun.

Here we want to predict if Apple’s share price is going to go “down a lot” (down by more then -1%), be “unchanged” (up or down 1%), or up a lot (up more than 1%).  By defining these states, we can then use our classifier to predict what the stock price is going to do (initially I had 5 states, including “up a bit” and “down a bit”, but the results were even worse).

Our hypothesis is that there is some combination of stock changes on a Monday that will tell us something about how Apples price will move on Tuesday.  We hope to use statistics and machine learning to find that combination.

So where is the “big data”?  Well in order to build our classifier we are going to use the daily percentage change of every stock on the NASDAQ, which is about 2,000 different stocks.  We are going to do this over 200 days of trading data for our training set, then 20 days for our test set.  This is only 400,000 data points which is pretty small, but I wanted an example where I could distribute the data easily.  Bioinformaticians will use similar techniques where rather than 2,000 stocks we may have 1,000,000 different loci on the genome and rather than 200 days, we may have 20,000 samples (20,000,000,000 data points which I think classifies as “big data”).

You can download the data used in this analysis here [4mb].


Loading new libraries

The strength of R is in its libraries which we will make extensive use of.  To get them all installed, you will need to enter the following commands.


Reading in the data

Firstly download the files here, and extract them to a directory.  We then need to point R to the directory you extracted the files to (“File” > “Change dir”).  Enter the following commands:

X <- as.matrix(read.csv("training.csv"))
X.T <- t(X)
Y <- read.csv("training_changes.csv")[,]
zz <- as.vector(Y)
# Remove the "DOWN_SML" and "UP_SML" categories 
zz[zz=="DOWN_SML"] <- "UNCH"
zz[zz=="UP_SML"] <- "UNCH"
Z <- as.factor(zz)


X now contains our training data, which has a column for each stock and a row for each day.  Each cell is the percentage change in price for that day.  X.T contains the same data, but the rows and columns switched.  Z contains our classifications (“DOWN_BIG”, “UNCH”, “UP_BIG”) for Apple on the next day.  Y contains a higher level of accuracy, discriminating between small moves up and small moves down (less than 1% in each direction), which we will leave out as they ruined the downstream analysis!

Narrowing down the data

We have 2,580 stocks here that might have an impact on our model, but most of them probably won’t.  So what we want to do is to exclude all those that have no significance.  The way we do this is to calculate permutation adjusted p-values for step-down multiple testing procedures.  Essentially we doing an ANOVA regression for each of the different stocks to determine if the stock is significant in determining what Apple’s stock will do the next day. 

Let us assume that a stock is significant in predicting Apple’s stock if it has a p-value of 0.05 or smaller (the probability that the variance between classifications is the same as the variance within classifications).  Because we have 2,580 stocks, we would expect to find 129 stocks that are significant just due to random chance (0.05 * 2,580).  We therefore need to do some kind of “multiple testing” adjustment, which is exactly what the following command does.

maxT <- mt.maxT(X.T, Y, B=1000, test="f")
length( which(maxT[,4] < 0.05 ))
rownames(X.T)[maxT[ which(maxT[,4] < 0.05), ]$index]

This gives us 10 stocks that are significant at the 0.05 level when corrected for multiple testing ("PCO"  "CCBG" "ABAX" "ELRC" "FISI" "INTL" "HFWA" "WINA" "CDZI" "BMTC").  We will limit our downstream analysis to just these stocks, using the following code.

X.T.I <- X[,maxT[ which(maxT[,4] < 0.05), ]$index]
X.I <- t(X.T.I)

K-Nearest Neighbours Clustering

This is a simple classification system that classifies an unknown point by giving the point the same classification as the k nearest “known” points.  We can build this classifier and test it on our training data (self-validation) as follows:

X.KNN <- knn( X.T.I, X.T.I, Z, k=10)
sum(X.KNN == Z) / length(Z)

This gives us a success rate of 0.59, which is optimistic as we are using the same set to train as to validate our model.  We can use cross-validation (where we leave out a chunk of the training set to train, then test it on the remaining set of data) to get a more accurate estimate of success: <-,  Z, k=10)

Here we see that using cross validation, our success rate drops to 0.5, which is certainly not at a level you would feel comfortable investing your hard earned cash on.

Of course, the choice of K=10 is completely arbitrary, so we can attempt to find a better value of K by trying a large number.  We get our best value at K=7, but it’s still only 0.31.  However each time you run a cross-validation you will get a different number due to the random selection of which chunk it left out for validation.

Penalized Generalized Linear Models

Linear models are rather simple, fitting a straight line to a set of data based by minimizing the error (distance from the model).  Generalized Linear Models are a generalization of this technique that makes different [thanks LikeLeeHood of /r/statistics] assumptions about the data, enabling it to model non-normal distributions (any distribution that is a member of the exponential family of distributions, which includes binomial and Poisson).  The GLMNET package uses penalized linear models, rather than true linear models.

X.GLM <- cv.glmnet(X.T.I, Z, family="multinomial", type.measure="class")
X.GLM.M <- glmnet(X.T.I, Z, family="multinomial", lambda=X.GLM$lambda.min)
X.GLM.P <- predict(X.GLM.M, type="class", newx=X.T.I)
sum(X.GLM.P == Z) / length(Z)

Here we use cross validation to select the minimum lambda for our model (the model with the least error), then predict using the model.  When we run the model against our training data, we end up with a classification rate of 0.54, only slightly better than KNN.  One must keep in mind that these predictions are made with the same data that generated the model, which makes this poor performance even sadder.

Support Vector Machines

If you think of a linear model as trying to draw a line through a cloud of points, support vector machines are all about drawing lines to separate different clouds of points that belong to our different classifications.  In the linear case, these lines are straight, but we can step it up a notch by transforming these otherwise straight lines so that they form curves.  For where we have more than 2 dimensions, we create hyper-planes to separate our clouds out.  There is some pretty heavy math behind all this, but luckily there is an R library for that!

X.F <- data.frame(Z, X.T.I)
X.SVM.B <- best.svm(as.factor(Z)~., data=X.F, type="C-classification", kernel="radial")
X.SVM.M <- svm(as.factor(Z)~., data=X.F, type="C-classification", kernel="radial", cost=X.SVM.B$cost, gamma=X.SVM.B$gamma)
X.SVM.P <- predict(X.SVM.M, X.F)
sum(X.SVM.P == Z) / length(Z)

This yields a correct classification rate of 0.63, which is our best rate so far.

Classification Trees

Classification trees attempt to build a tree like structure where a threshold is created for the value of each stock whereby if it’s less than that value, we go down the left branch of the tree, and if its greater we go down the right branch.  After traversing down many levels of the tree, we end up at a node which should tell us the classification.  We will draw out the tree with the following commands:

X.CT.M <- rpart(Z~., data=X.F, method="class")
plot(X.CT.M, mar=c(0.1, 0.1, 0.1, 0.1))
text(X.CT.M, use.n=TRUE)

It looks pretty ugly, and is likely to have overfit the data, so we want to prune the tree to remove excess decision points to make our model more general.  This is done with the following command:

X.CT.Pruned <- prune(X.CT.M, cp=0.021739)

Where 0.021739 is the CP value that has the lowest cross validation error (in this case 1.1087).  We then run our classifier against our training set to get a feel for how good it is.

X.CT.P <- predict(X.CT.Pruned, type="class", newdata=X.F)
sum(as.vector(X.CT.P)==Z) / length(Z)

This yields a correct classification rate of 0.69, putting classification trees in the lead.


Up to this point we have only used the training data to validate our models, which is pretty much meaningless.  What we need to do is to test our models on a separate set of data.  Luckily we have a training set of data, which are the classifications over the next 20 days (4 days UP_BIG, 9 days UNCH and 7 days DOWN_BIG).  So let’s see how our models do against actual data.

X.Test <- as.matrix(read.csv("test.csv"))

# Select only the stocks we found significant earlier
X.Test.I <- X.Test[,maxT[ which(maxT[,4] < 0.05), ]$index]
Y.Test <- read.csv("test_changes.csv")[,]
zz <- as.vector(Y.Test)
zz[zz=="UP_SML"] <- "UNCH"
zz[zz=="DOWN_SML"] <- "UNCH"
Z.Test <- as.factor(zz)



X.Test.KNN <- knn( X.T.I, X.Test.I, Z, k=10)
sum(X.Test.Knn == Z.Test) / length(Z.Test)
[1] 0.4


X.Test.GLM.P <- predict(X.GLM.M, type="class", newx=X.Test.I)
sum(X.Test.GLM.P == Z.Test) / length(Z.Test)
[1] 0.45



> X.Test.F <- data.frame(Z.Test, X.Test.I)
> X.Test.SVM.P <- predict(X.SVM.M, X.Test.F)
> sum(X.Test.SVM.P == Z.Test) / length(Z.Test)
[1] 0.45



> X.Test.CT.P <- predict(X.CT.Pruned, type="class", newdata=X.Test.F)
> sum(as.vector(X.Test.CT.P)==Z.Test) / length(Z.Test)
[1] 0.3



So none of our techniques really inspire much confidence, with none of the strategies being correct more than 45% of the time, which is not much better than guessing.

What if we take the strategy that we only buy AAPL stock when our model tells us that the next day is going to be “UP_BIG”?  Well our KNN model suggested this would happen 4 times, with 2 of those times resulting in big losses (AAPL was down by more than 1%), and the other two being “UNCH”.  Classification Trees did a bit better with its 4 predictions, getting 1 correct and 3 “UNCH”.

The best advice came from SVM and GLM, as they predicted “UNCH” for each of the days, effectively telling us not to bother.

Interestingly, both GLM and SVM predicted “UNCH” for all of the test dataset, providing perhaps the best advice – not to invest using this strategy at all!

If you feel I have made any errors in this analysis, please let me know - there has been a lot of copying and pasting between R and Word!