data-structures bit rank space-complexity

What is a succinct rank data structure? How does it work?

I've heard of a class of data structures called succinct rank data structures. What do these data structures do? What does "succinct" mean here? And how do they work?

Solution

The binary ranking problem is the following. You're given an array of bits, which we'll denote B. The bit array B has n bits in it. The goal is to preprocess B so that you can efficiently answer queries of the following form:

Given an index i, what is the sum of the first i bits of the array?

This is called a rank query, and we'll denote it as rank(i).

For example, suppose you were given this bit array:

11011100101110111100

With our notation from above, rank(5) would be the sum of the first five bits of the array. Those bits are 11011, so we'd have that rank(5) = 4. Similarly, you can check that rank(10) = 6. As an edge case, we have rank(0) = 0, since if you add up no bits you get back 0.

Intuitively, this seems like a pretty easy problem to solve. We can preprocess the array by simply writing down all the prefix sums in a separate array. That might look like this:

The bit array 11011100101110111100 annotated with the prefix sums at each point. Those prefix sums are 0, 1, 2, 2, 3, 4, 5, 5, 5, 6, 6, 7, 8, 9, 9, 10, 11, 12, 13, 13, 13.

With things set up this way, we can, in time O(1), compute rank(i) by simply looking up index i in this auxiliary array.

So... we're done, right? Unfortunately, no. Let's take a minute to think about how much total space this array takes up. It's an array of n+1 integers (we have one array element for each position in the array, plus one more right after the last element of the array), and so it sure seems like this would use O(n) total space. While in a sense this is true, though, this figure is misleading.

Imagine, for example, that we're on a 64-bit machine, where each integer is represented as a group of 64 bits. A naive implementation of the above approach might use 64 bits for each of the integers in our array, meaning that we'll need to use (roughly) 64n total bits for this array. Compare this against the amount of space required to write out our original array of bits B. The array B is n bits long, so it uses only n bits of memory. This means that our auxiliary table, which stores all the prefix sums, uses 64 times as much space as the original bit array itself!

We might shrug this off as not that big a problem. Sure, it's 64× bigger than the original array - but is that a problem? And unfortunately, it is indeed an issue. Many of the applications where you would want a bit array like this, such as storing massive text strings or gigantic trees, work with data sets that just barely fit into memory to begin with. Figure, for example, that you're working with an array of bits that's 1GB long. In that case, storing this auxiliary array would take up 64 × 1GB = 64GB of memory to write out - an enormous amount of memory! At least in 2022 when I'm writing this, that's way more than your average desktop machine has. (If you're reading this in the 2030s, just change "GB" to "TB" and you'll have the same visceral reaction that us 2020ers feel. ^_^)

Our goal in this discussion will be to build a data structure that allows us to quickly answer rank queries, but which uses as few extra bits of memory as possible. This will take us into some unusual territory. We'll be quantifying the space usage of our data structures by counting up the total number of bits we use, which means that we cannot say that storing a single integer uses O(1) memory. And we'll need to bust out some clever techniques that, while well known to the community of data structures researchers, aren't that common in general computer science or software engineering.

Step 1: Use Space-Efficient Numbers

Our first step in this journey will be to take the above idea - make an array of integers that stores all the possible prefix sums - and to whittle down the 64× memory blowup to something more reasonable.

To see how to do this, let's imagine that we're working with an array of bits that's exactly 1023 bits long. Why 1023 bits? It's because that's exactly 2¹⁰ - 1 bits. Now, suppose we were to write down prefix sums for each of the bits in that array. Because our array of bits only has 2¹⁰ - 1 bits in it, each prefix sum will be a number between between 0 and 2¹⁰ - 1, inclusive. (Why? Because each bit is either 0 or 1, and in the absolute worst case when all bits are 1, the total sum will be 2¹⁰ - 1.) If you think back to how binary numbers are represented, this means that each of our prefix sums can be represented as a 10-bit number. It would be wasteful to use a full 64-bit integer to write out each of these prefix sums; we'd only be using 10 of those 64 bits, and the other 54 bits would always be zeros!

If we were to look at how this would be represented in memory, it would look something like this:

A schematic of a 1,023-bit array with a separate array running down the side made of prefix sums of those bits. Each prefix sum is written out as a 10-bit number.

Most programming languages (and chips, for that matter) don't support working with an array of 10-bit integers. But thanks to the power of bitwise operations, we can easily simulate something like this. We'll form an array whose elements are 64-bit integers. We'll then completely ignore the fact that they're 64-bit integers and instead just treat our array as a long stream of bits. To get a particular 10-bit integer value out of that array, we just need to do some bitwise operators to locate which 64-bit integer(s) hold our 10-bit value, then extract the relevant bits and reassemble them. It's not the most straightforward calculation to do, but it's also not that terrible. Here's some C code for this:

/* Given an array of elements that are each bit_width bits long,
 * represented as an array of 64-bit integers holding the relevant
 * bits, extract the element at position index. It's assumed that
 * bit_width <= 64.
 *
 * "Beware of bugs in [this] code; I have only proved it correct, not tried it."
 */
uint64_t extract_bits_from(uint64_t* bit_array,
                           unsigned bit_width,
                           unsigned index) {
    assert(0 < bit_width && bit_width <= 64);

    /* Special-case the scenario where we want a 64-bit value,
     * which just means we do an array read.
     */
    if (bit_width == 64) return bit_array[index];

    /* Otherwise, we have 63 or fewer bits. */

    /* Logical index of the first bit that we want to extract. */
    unsigned first_bit = bit_width * index;

    /* Physical index into the array of 64-bit integers where we
     * need to look.
     */
    unsigned int_index = first_bit / 64;
    uint64_t first_int = bit_array[int_index];

    /* Determine how many bits we're going to pull from this
     * number. To do this, we'll look at the offset of the bit
     * position we start at and see where that falls in our 64-
     * bit value.
     */
    unsigned bit_start = first_bit % 64;

    /* There are two options. The first is that all the bits we
     * need are in this integer. In that case, extract them and
     * go home.
     */
    if (bit_start + bit_width <= 64) {
        /* Shift everything down by the appropriate amount,
         * then mask off the higher bits.
         */
        return (first_int >> bit_start) & ((1ULL << bit_width) - 1);
    }

    /* Otherwise, pull the lower bits from this integer and the
     * higher bits from the next integer. First, we have to see
     * how many bits to read.
     */
    unsigned low_bit_count = 64 - bit_start;
    uint64_t low_bits = (first_int >> bit_start) & ((1ULL << low_bit_count) - 1);

    unsigned high_bit_count = bit_start - 64;
    uint64_t next_int = bit_array[int_index + 1];
    uint64_t high_bits = next_int & ((1ULL << high_bit_count) - 1);

    return low_bits | (high_bits << low_bit_count);
}

In the specific case where we're working with an array of 2¹⁰ - 1 bits, this approach will require an auxiliary array of 10n bits. That's much lower than the 64n bits we started with, though it's still a huge blowup over the original array size.

Before moving on, let's take a minute to generalize this idea. If our array has 2¹⁰ - 1 bits in it, we need 10 bits per number. Similar reasoning tells us that if our array has 2¹⁵ - 1 bits, we need 15 bits per number. Running this process backwards, you can work out that with an array of n bits, we need log₂ (n+1) bits to write out each prefix sum. This means that, in the general case, this approach will allow us to answer rank queries in time O(1), and will use O(n log n) total bits.

We can then ask: is there a way to knock down our space usage? Fortunately, the answer is yes. But to get there, we're going to need to make a series of clever insights.

Step 2: Write Down Fewer Numbers

Why is our space usage O(n log n) right now? That's because

we are writing down an array of O(n) numbers, and
each number requires O(log n) bits.

To reduce our space usage, we either need to write down fewer total numbers, or write down fewer bits per number, or, ideally, both. Right now it might not be clear how to do either of these, but it turns out that each of these ideas becomes pretty natural once we look at things the right way.

Our first step will be to use this first idea and to write down fewer numbers. Here's a nice way of thinking about how to do this. Right now, we're writing down a prefix sum right before each of the numbers, as shown here:

A bit array, alongside a secondary array that stores the prefix sums at each number

That makes it easy to answer rank queries: we can just look in our array of prefix sums and read off the answer.

However, here's a cute little observation. Suppose that instead of storing the prefix sums just before each bit, we store the prefix sum before every other bit. That would look like this:

The same diagram as above, except that every other entry of the prefix array has been removed.

This might seem like a strange idea - we've just thrown away half the prefix sums! - and yet it still lets us query prefix sums efficiently (in time O(1)). Here's how. If you want to compute a prefix sum at an even-numbered position, just read off the precomputed value within that array slot. For example, to compute rank(6), we'd look up the array entry at index 6 / 2 = 3, zero-indexed. That's 5, which is indeed the correct answer.

If you want to compute a prefix sum at an odd-numbered position, on the other hand, we won't be able to read off a precomputed value right from the get-go. However, our odd-numbered entry is right next to an even-numbered entry. We can compute the prefix sum by reading the even-numbered entry right before us, then adding in the value of the bit that appears right after that position. For example, to compute rank(5), we'd first compute rank(4) = 3 by looking at our precomputed array. Then, we'd look at the bit at index 5 (zero-indexed) in the bit array. It's a 1, so the answer to our rank query is 3 + 1 = 4. Tada!

Overall, this approach halves the amount of extra memory we're using (we've thrown away half of our numbers), and our queries are just about as fast as before. Whereas previously we had just had to look up a single array entry, now we have to look up an array entry, then look at one bit from the original array of bits.

We've just cut our memory usage down by half - a huge improvement from before! Could we take this idea further? The answer is yes. We'll begin by picking some integer b that we'll call our block size. We'll then group our array of bits apart into blocks of b bits each. Then, we'll only write down prefix sums at the start of each block, effectively only storing a 1/b fraction of the original number of prefix sums we started with. For example, here's what this might look like on a sample bitvector with b=8:

A bitvector of 48 bits. It's been subdivided into 6 blocks of 8 bits each. At the start of each block, we've written the sum of all the bits before that block

To see how to compute rank queries here, let's work through rank(30), the sum of the first 30 bits. The idea here is a generalization of what we did earlier: we'll find the last array entry where we have a prefix sum computed, then add in the missing bits from the array itself. Here's how we do this.

First, we compute ⌊30/8⌋ = 3, the index of the block where the query ends. We then look at array entry 3 (zero-indexed) to get the sum of all the bits up to the start of that block, which is 14.
The number we have is the sum of the first 24 bits, and we want the sum of the first 30. To do this, we need to read 30 % 8 = 6 more bits from the block. So read the first six bits of block 3 (zero-indexed) and add them up. Doing this, we find that the sum of those bits is 4.
Add these numbers together to get 14 + 4 = 18, our final answer.

How fast is this approach? Well, step (1) always takes time O(1). It's just an array lookup. Step (2), however, depends on how big we make our blocks. If b is small, we won't have to scan over too many bits. If b is large, we'll have to scan a large number of bits. Overall, the total work we do ends up being O(b).

But what of our space usage? Well, if we were to store the full array of prefix sums, we'd use O(n log n) space: there would be (roughly) n numbers, each of which uses (roughly) log₂ n bits. But now that we're only storing the prefix sums at the start of each block, we're storing only roughly 1/b of these prefix sums. That means we now are storing approximately n / b prefix sums, each of which still needs approximately log₂ n bits. That makes our space usage O((n log n) / b).

So we're now left with an interesting situation:

Our space usage is O((n log n) / b) bits. This means we want to make b large so that we don't use too much memory.
Our query time is O(b). This means that we want to keep b small so that our rank queries can be done quickly.

There is no optimal choice of b to make here to balance out these forces. If you absolutely must get rank queries running quickly, you'd have to pick a small choice of b and pay some extra memory cost. On the other hand, if you're okay with slower queries, you could crank b up pretty high to drop the space usage down to a manageable amount.

This leads to a question: is there a way to get the best of both worlds? That is, can we get lower space usage while also getting fast queries? The answer, surprisingly, is yes! Let's see how we get there.

Step Three: Write Down Smaller Numbers

Earlier on, we mentioned that there were two different ways we could try to reduce our O(n log n) storage space. The first was to store fewer numbers. We've done that by only writing down prefix sums every so often. The second was to write down fewer bits per number. That might seem impossible, but when you look at it the right way it's actually not too bad. Let's explore that second option.

As a refresher of where we are right now, we've broken our array apart into blocks of b bits each, for some parameter b we can choose however we'd like. We've then written down the prefix sums at the start of each block. Things look like this:

A schematic of breaking bits apart into blocks of b bits each, then writing the prefix sum at the start of each block

Currently, the cost of performing a rank query is O(b). That's because as part of a rank query, we have to scan across the bits of a block, adding them up, and the way we're doing that is by running a linear scan across those bits. Is there a way to speed things up?

Indeed there is, and this is where we get to an important insight. If you think about it, scanning across the bits of a block and adding them up is essentially the same as performing a rank query on a smaller array of bits. That is, we've started off by trying to answer the question "what is the sum of the first i bits of the overall array?," and we've turned it into "what is the sum of the first i % b bits of the block we end up in?" In other words, we're left with a smaller version of the same problem we started off with!

What can we do with this information? Our very first strategy for performing rank queries was to write down prefix sums for each bit. That would allow us to very quickly compute prefix sums without having to read over lots of bits from the original number. Let's try repeating this idea here. What happens if, within each block, we write down the prefix sums before each bit? That might look like this:

The two-level structure. At the top level, we subdivide our bitvector into blocks of size b, and write down the prefix sums at the start of each block. Within each block, we write down prefix sums for the bits in that block. For example, for the block 11000100, we'd write down the numbers 0, 1, 2, 2, 2, 2, 3, 3

Here, I've shown the prefix sums within just one of these blocks, but we'd have similar prefix sums running across all the blocks. I just couldn't find a way to fit that all into one picture. :-)

Here's what a query now looks like. Suppose we want to compute rank(20), the sum of the first 20 bits. Here's how we do it.

As before, we compute ⌊20 / 8⌋ = 2 to get the index of the block where the query ends. We then read index 2 (zero-indexed) from the top-level array of prefix sums to get the sum of the bits up to the start of this block, which is 11.
Now, we look within our block. We currently know the sum of the first 16 bits, and so we need to get the sum of the first 20 % 8 = 4 bits of this block. To do that, we look at the table for this block and read entry 4 (zero-indexed). Doing so tells us that the sum of the first four bits of the block is 2.
We add these values together, which gives us our total of 11 + 2 = 13, which is the correct answer.

Notice that this entire process is driven by table lookups - there are no linear scans required! And in fact, regardless of what choice we make for our block size b, we end up doing O(1) work per query: the cost of doing the necessary divisions, mods, and table reads. Amazing!

How much space does this approach need? There are two components to this. The first place where we're using auxiliary storage space is for that top-level array of prefix sums. As we saw before, if we have blocks of b bits each, this uses O((n log n) / b) bits.

But now we have to factor in the space required to write down prefix sums within each of the blocks. How much space will we need for this? The major insight we need here is that prefix sums within a block use fewer bits than prefix sums across the whole array. In an array of n bits, each prefix sum requires O(log n) bits, since the prefix sum can be anything from 0 to n, inclusive. But in a block of b bits, where b is probably much smaller than n, we only need to use O(log b) bits for a prefix sum, since within a block the sum can range from 0 to b, inclusive. This will turn out to be a major idea for us in designing these data structures: if you have a smaller array, you need fewer bits for your prefix sums!

To work out how much total space is needed for all those prefix sums across the whole data structure, we can use the following calculation. In each block of b bits, we will write down b prefix sums, each of which is O(log b) bits long, so we need O(b log b) total bits. And since there are roughly n / b blocks (n bits are grouped into blocks of b bits each), the total space usage for the relative indices within each block is O(n log b).

Overall, this means that the total amount of memory we're using for this two-level approach is O((n log n) / b + n log b) bits. That's a weird expression, so let's take a minute to unpack it.

If we pick b to be tiny, then our blocks have few bits each. This means that the O(n log b) term is small. However, this also means that we have a large number of blocks, so our O((n log n) / b) term will be too big.
If we pick b to be huge, then we won't have that many blocks. Specifically, we'll have n / b blocks, and for huge values of b this quantity is small. That means that the O((n log n) / b) term will be small. However, with a large value of b, our blocks get huge, and the number of bits required to write down an offset within a block will get too big (i.e. the O(n log b) term will be too big).

The fact that making b too small or too big isn't good for our overall space usage suggests that there's an optimal choice of b to make that would balance these terms out. And indeed there is! If we pick b = O(log n) (that is, pick b to be some multiple of log n), then our overall space usage is minimized. In particular, our space usage comes back as follows:

O((n log n) / b + n log b)

= O((n log n) / log n + n log log n)

= O(n + n log log n)

= O(n log log n)

Et voila! We've gotten our space usage down to O(n log log n) bits. Keep in mind that for any reasonable value of n, the quantity log log n is tiny. For example, suppose you have an array of n = 2⁶⁴ bits. Then log n = 64 (assuming we're using base-2 logarithms, which we are) and log log n = 6. This is a great improvement on our original O(n log n) bits of storage space!

And yet, while log log n is a small number, because our solution uses O(n log log n) bits of memory, it still requires more memory than the original array of bits we started with. Is there a way to drop the space usage even further?

Of course, the answer is yes. And doing so will involve some surprising mathematics.

(This is Part One of a two-part answer. Part Two is available here.)