c algorithm search compression inverted-index

Compressing a list of integers for search access

I have an prefix-to-rows mapping that looks something like this:

{
    "x":   [9],
    "far": [1,2,3,4,5],
    "car": [1,4,5]
}

The key is the indexed search term and the array is a sorted list of the rows that have a match. Simple enough, a basic inverted index. And for this question, let's suppose a-z0-9 characters with a maximum length of three characters (upper bound or 36+(36^2)+(36^3)=47,988 combinations, though probably much less in practice, let's say around 10k combinations).

However, the tricky part is that I may have ~ 10M rows, and low-cardinality items could have a (meaningless) list of all 10M rows. In my calculations an 10M-row array itself comes out to 88.9MB uncompressed.

What would be a suggested way to compress these often-repeated arrays? It seems that this must be a very common occurrence in search, and I'd like to learn a bit more about the best method of handling large and repeating prefix maps, such as with the above.

Solution

I would recommend using something like Roaring Bitmaps (described in this paper). There are implementations in Python, Java, and C, and they automatically switch between 3 different formats for optimal storage density. If you want to implement something similar yourself, it essentially combines:

Arrays (the default)
Bitsets (good for more dense collections)
Run-length encoding (store the start of every "run" of continuous numbers and the length of that run)

The concept works for 32-bit unsigned integers, which comfortably could contain information on 10M rows without any additional work necessary for customizing your own solution.