Building data structures that are smaller than an Array AND faster (in C)

This is being discussed over at Hacker News, check it out.

 In developing software for large data sets (billions of records, terabytes in size) the way you store your data in memory is critical – and you want your data in memory if you want to be able to analyse it quickly (e.g. minutes not days).

Any data structure that relies on pointers for each data element quickly becomes unworkable due to the overhead of pointers.  On a 64 bit system, with one pointer for each data element across a billion records you have just blown near 8GB of memory just in pointers.

Thus there is a need for compact data structures that still have fast access characteristics.       

The Challenge

The challenge is to come up with the fastest data structure that meets the following requirements:

  • Use less memory than an array in all circumstances
  • Fast Seek is more important than Fast Access
  • Seek and Access must be better than O(N).

Where Seek and Access are defined as:

Access (int index): Return me the value at the specified index ( like array[idx] ). 

Seek (int value): Return me all the Indexes that match value.

(The actual return type of Seek is a little different, but logically the same.  What we need to return is a bitmap where a bit set to 1 at position X means that value was found at index X.  This allows us to combine results using logical ANDs rather than intersections as detailed here)


Possible solutions

Array (array) 

The first solution which we compare all others too is the humble integer array. 

Memory: 4 bytes for each integer, making space complexity O(N).

Access: Nice and fast, O(1).

Seek: Awful! We need to check every item in the list and is O(N).  Not good enough!


Wavelet trees (wtree)

This is a type of “succinct data structure”, which is pretty incredible.  It actually blew my mind when I first read about it here

The structure computes a list of all the available keys (distinct values, ordered of length K), then stores a bit array that is N bits long (the number of values).  It then needs Log2(K) such bit arrays, which I will refer to as rows.  Essentially the bitmap is a representation of a binary search tree.  

The data structure takes a bit of work to get your head around, so I will explain it with an example.  Say we have the data:

3 2 4 6 2 6 (Keys: 2, 3, 4, 6)

If we store the keys, we can re-write our data as:

            1 0 2 3 0 3

Simply by storing the indexes to our keys array (our first element in our data is 3, which can be found at keys[1]). Easy yeah?

Now we build out first bitmap row.  Looking at our new data set, our minimum is 0, our maximum is 3, so if we do (max – min) / 2 we get 1, which we will use as our middle value.  So to write out first row, we will write a 0 if the value in our index is less than or equal to our middle value (in this case 1) and 1 if it’s greater.

This means our first row is:

            0 0 1 1 0 1

For our next row we group all our ones and zeroes together, such that we have a) and b), where a) is made up of all the zeroes and b) is made up of all the ones.

Group a) is now made up of the following values 1 0 1 which we use to repeat the same process as the first row.  We now use the “middle” value from the last row as our “maximum”, and keep the same minimum.  We then calculate our new “middle” (max – min) / 2 = (0 – 1) / 2 = 0.  This means for Group a) we need to write a zero for each 0 and a one for each 1 (that’s worked out nicely hasn’t it!).  Therefore the bitmap for Group a) is:

            1 0 1

Group b) is made up of the following values: 2 2 3, so we can repeat the same process as above.  Here our max is 3, min is 2 and our middle value is 2.  This means our final bitmap for Group b) is:

            0 0 1

Meaning our representation of the 6 values is now:

            0 0 1 1 0 1 (first row)

            1 0 1 0 0 1 (Group a + Group b)

Awesome, our 6 values are now represented by 4 integers and 12 bits.  You can see how for each additional value we add (provided we don’t add more keys) only adds a pair of bits to our storage size, making this structure very efficient.


Accessing these structures is a little more complex than accessing an array. Essentially we walk down the “rows” checking whether the bit at the corresponding index is set.  The key here is that at each level our index needs to be modified to compensate for the fact we have “grouped” values at each level. 

So if we are looking for the value at index 3 (which we know to be 2), we first inspect the bit at index 3 of row 0.

            0 0 1 1 0 1

Since this is a 1, our index needs to change to be the first index of group B.  This is done by counting the number of Zeroes in the first  complete row, which gives us the point where our group starts.  By then also tracking how many “1” bits are set before my index, I can adjust the index to check for the next row.

The bad news is that this means counting a lot of bits.  The good news is than Nehalem processors have an SSE4.2 instruction that enables us to count 0’s and 1’s VERY quickly (popcnt).  The bad news is that the indexes are not necessarily aligned to words, so additional work is required in aligning bitmaps to word boundaries (which I have not done).


Seek is much more complex, but works by traversing our keys list as if it was a binary search tree.  Here we first take the middle value of our keys, and compare it to our value.  If our value is larger than the middle value, our bitmask tells us which values are also larger. For example you can think of the first row as telling us which numbers are bigger than 1 (where 1 is set), and which are smaller (where 0 is set).

The goal of the seek to return a bitmap the same length as our rows, with a 1 set wherever the value matches the corresponding index.

So if the value we are searching for is larger than the middle value, our temporary mask becomes the first row.  If it was smaller, then we flip the bits on the first row.

We are then directed to either group a) or group b) depending on the result of the first comparison (group a if its smaller, or group b if its larger.   At this point we find out what the “middle” value is for this group and compare it to the value we are seeking.  If our seek value is larger than this value, we can just use bits as is.  If its smaller or equal then we need to invert it.  We then use this as a “group mask” which we use to refine our temporary mask.

The refine operation simply looks at all the 1 bits in our temporary mask, and works out what number bit it is in the mask if there were no 0’s present. (thus in “0 0 1 0 1 0 ”, the index of the first 1 would become 0, and second 1 would be 1).  We then use the “group mask”, which tells us which of those set bits need to be set to one.  Take the following example:

0 0 1 1 0 1      (temp mask)

0 1 0               (group mask – note it is 3 bits long the same as the number of 1’s in temp mask)

0 0 0 1 0 0      (result, the first and third ones are set to 0).

I have implemented the code to do this, but it is far from efficient (if you like making C code fast, this may be a good project for you!).  In my implementation I also store the “rank indexes” (the start of each group) so as to prevent the entire bitmap from needing to be scanned all the time, which makes it faster but use more memory.

Hybrid (tmap)

Given the complexity of Wavelet tree’s and its disappointing performance, I developed a hybrid, which stores values in the same way as a wavelet tree, but does not group values.  This makes accesses much faster as the index doesn’t change.  It also makes seeks much faster than an array, as we just do a logical AND down the rows.  Since our bitmaps are implemented in 64bit integers, and we have only Log2(K) rows, we can test (64 / (Log2(K) -1)) values per operation. 


So how do they perform?




























Access Performance


























Seek Performance



























There is no doubt that on paper, the Wavelet tree structure should perform the best out of all the structures, but as we can see here, its performance is AWFUL.  I have no doubt that there is plenty of room for optimization (particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.

So the challenge remains open.  Can you make a faster, smaller array than those described here?

You can view the source code for this project at


29 responses
If we store the keys, we can re-write our data as:

1 0 2 0 2 3

That bit seems wrong?

Yeah, shouldn't it be 1 0 2 3 0 3? Or am I so dumb that I can't even get the first part right? ;)
Yes, I messed that up (it is now corrected).

Everything down stream from that seems ok, but to make a mistake at step 1 probably explains why it took me so long to implement it in code!


Some comments:

- as the wavelet tree depends on the number of distinct keys it would be important to report those values in your graphs. I can only see the array size.

- based on your code you used sequential accesses to test the access() speed. this will likely result in cache effects that you wouldn't get with random accesses. Also the data you tested is quite small and a larger experiment would have been nice?

- I did some small test comparing your wavelet tree access method to one I wrote and yours is 52x slower. there might to be something wrong with your code.

- your seek() operation could be replaced by calling the select() operation nocc times. not sure which one is faster.

Nice pic! Well, if you are looking for the right name for that you can choose from these: Irish names, French names and muslim baby names. Be sure to be unique.
Hello Mathias,

To address your points:

- My size comparison does include the size of the keys (everything that goes through malloc and free get counted).

- Yes, this sequential access may be "cheating" for the array version especially. The others hit a lot more data, so they are probably being hit unfairly. I will resole this in my next update.

- Yeah, the Wavelet tree code performs awfully. I was just so damn excited to finally get the code to produce correct answers that I felt I had to share it with the world. I will now be re-structuring the code to use rank/select dictionaries rather than my bit vectors

So yeah, stay tuned! Hopefully I will have something awesome working soon!

Terence: what I meant was that you have to run different experiments for different values of K (the number of distinct keys) as this affects the height of the wavelet tree and therefore the performance.
Looks like there should be a couple down stream corrections, group a) is shown as 1 0 1, but should be 1 0 0, and group b is listed as 2 2 3 but should now be 2 3 3
Thanks for the comment Joe, yes I was not quite a careful with my proof reading as I should have been. It's little wonder it took me so long to implement this.
Thanks for the correction, I'll update the post tomorrow.


Ok, you've tried three structures.
You talk about memory and memory access times, then you should then look at how the data is put into memory, not necessarily what data is put into memory.

One thing remained the same all the time, that is malloc().
I see malloc() as the weak point in this showdown.
I've seen different malloc implementations, zmalloc, tmalloc, xmalloc() etc.
One that convinced me was one written from somebody who made his own kernel and malloc implementation. That one was written in assembly and carefully thought out.

One thought, I think that the fastest array isn't one that get's linearly chunked by the MMU->TLB, but one that you can algorithmically cut to access any data.
I know of ractal-random arrays that could be implemented to achive that. Of course it'll slow down memory allocation, but seeks/access will be immensely faster. Now to cope with that limitation by design, you can converge memory allocation into using ractal-random arrays when the data requires it or a flag has been set, or using normal memory allocations when there is no need for that type/size of data or a flag has been set.

my 2 cents. I think I'll try that one sooner or later :)

This takes me way back,

First of all, congrats on diving into C/C++. You're going to rock it as a software engineer after this experience.

Second, try pointers. Solving for speed and memory is difficult without directly managing memory and structures at the lowest level. Its easy to optimize one or the other in a pure hypothetical scenario.

The problem with optimizing for both is that the solution really depends on how the data is accessed. The advantage to pointers is that you can build a custom structure that closely matches how the data is used.

Mainly pointers let you build several structures around the same buffer so theoretically you are allocating structures once and building several maps. The advantage is one map can be optimized for writes and the other for reads, bu the data stays put.

Google does this and even asks a few questions along those lines at the interview. Nice post, got me thinking about how much I used to enjoy these types of problems on software projects. C/C++ is rare in that its one of the few languages that lets an engineer do this.

Interesting post, I hadn't heard of Wavelet Trees before and will have to investigate further, thanks.
You might try an array with HeapSort.
> (max – min) / 2

Shouldn't this be (max - min)/2 + min?

> Group a) is now made up of the following values 1 0 1 ...

Shouldn't this be values 1 0 0? (How can it be "1 0 1" when there is only one 1 in the original array?).

> Group b) is made up of the following values: 2 2 3 ..
Shouldn't this be values 2 3 3?. Also only one 2 in the original array.

> (max – min) / 2 = (0 – 1) / 2 = 0
Not (1 - 0) / 2?

A sorted array would get seek down to O(log n).
You should maybe try to get a low count of malloc rather than a low memory consumption. I read than doubling your memory size when your need more memory is ways faster than every other memory management techniques.

Your datastructure for huge datasets will likely we used on system with large memory ram. So find big or huge blocks won't be a problem

I am confused by the article. I am thinking that Joe Betz is correct. Looks different from Alex Bowe's illustrations as well.
> You can see how for each additional value we add ... only adds a pair of bits to our storage size, making this structure very efficient.

Packed array will use also 2 additional bits per new value. From you example I don't see any advantage over packed array.

Great work! Out of curiosity, have you tried compared your implementation against one like
9 visitors upvoted this post.