data-structures trie space-complexity estimation

How to estimate the size of a trie?

I have a trie that contains base 62 alpha-numeric keys, of 100B in length. I have 5 x 10 ^ 11 keys. How can I estimate how much RAM / disk space would be required to store this trie?

Solution

This is going to depend how you represent the trie and what space-saving optimizations you perform. It's also going to depend on the particular strings that you're storing.

For starters, note that if you have a trie with w total words in it, the minimum number of nodes you'll need to encode that trie is w, one for each word you're storing. We're going to use this as a lower bound on the space needed to encode the trie.

Let's begin with a simple trie representation strategy. Imagine that each node has the following form:

struct Node {
    bool isWord;
    Node* children[62];
};

On a 64-bit system, this is going to require at least 8 × 62 ≈ 500 bytes per node. Assuming you only need to store 5 × 10¹¹ nodes is then

5 × 10¹¹ × 500b = 250TB

This is completely impractical.

Since you have a static trie, another option would be to have each node store a fixed-sized array of character/pointer pairs, like this:

struct Node;

struct Child {
    char letter;
    Node* child;
};

struct Node {
    uint8_t numChildren;
    Child children[]; // Flexible array member
};

Since you're assuming that all your strings have the same length, you can identify which nodes are words because they'll have no children.

Now, how much space are we going to need? On a 64-bit system, factoring in padding, each Node will take up 8 bytes plus an extra 8 bytes for each of its children. If we have n total nodes in the trie, summing up across all the children counts of all the trie nodes will give back n - 1, since each node, except the root, is a child of one other node. That gives a space usage of 16n bytes to hold the trie. A conservative upper bound on the number of bytes needed to store 5×10¹¹ strings that are 100 characters long would be that we'd need 5×10¹³ nodes of 16 bytes each, for a net total of 800TB. That's still impractical.

Of course, the assumption that each of the strings will result in 100 new nodes being added in is not a reasonable one, and there will be a lot of sharing involved. But it still shows that we're going to need more compression to make this feasible.

An alternative option here would be to use a Patricia trie. In case you haven't seen Patricia tries before, the basic idea is the following:

Add in a new character that represents "end of string," then tack that character onto the end of each string. The convention is to use $ for that character.
For each node in the trie that only has one child, merge that node into its parent. This means that trie edges now can have multiple characters on them.

As an example, here's a Patricia trie for the strings "ant," "ante," "anteater," "antelope," and "antique":

A sample Patricia trie

There's a useful theorem about Patricia tries: A Patricia trie holding w words has at most 2w total nodes. (The basic idea behind the proof is that each node in the Patricia trie that isn't a leaf node has multiple children, and so the number of internal nodes can't exceed the number of leaves).

We can represent a Patricia trie compactly in the following way. First, write out all the strings being stored in the trie, one after the other, as a giant string. In the above example, that might look like this:

ant$anteater$ante$antique$antelope$

Then, represent each edge as a pair of integers denoting the start and end point of the start and end point of the characters making up that edge. For example, the edge labeled "ant" in the above Patricia trie could be encoded as the pair [0, 2]. The overall encoding might then look like this:

struct Node;

struct Child {
    size_t start;
    size_t end;
    Node* child;
};

struct Node {
    uint8_t numChildren;
    Child children[];
};

Each edge requires 24 bytes of storage. Each node requires 8 bytes of storage. And we also have to write down all the characters in all the strings. This means that the space usage is

(space for all characters of the strings) + (space for all nodes) + (space for all edges)

= (100w bytes) + (16w bytes) + (24 w bytes)

= 140 w bytes

= 140 × 5 × 10¹¹ bytes

70TB.

This is getting better, but it's still not great yet. But let's see if we can reduce some space.

For starters, the above calculation assumes that we store all 2w nodes in the trie. But w of those nodes are leaves, and we can encode those as null pointers in the Child struct. That reduces the number of nodes to encode by half, giving this space usage:

(space for all characters of the strings) + (space for all nodes) + (space for all edges)

= (100w bytes) + (8w bytes) + (24 w bytes)

= 132 w bytes

= 132 × 5 × 10¹¹ bytes

66TB.

By far the biggest space hog here is the space required to write out all the characters. Can we pare this down a bit?

You mention that your strings are all 100 characters long and have characters drawn from an alphabet of size 62. Adding in the null terminator gives 63 possible characters, so we can write each character out in six bits. This means that we don't need a full byte per character; six bits is enough. So that reduces our space usage like this:

(space for all characters of the strings) + (space for all nodes) + (space for all edges)

= (75w bytes) + (8w bytes) + (24 w bytes)

= 107 w bytes

= 107 × 5 × 10¹¹ bytes

53.5TB.

Now, this assumes we store all of the strings as-is, but that's not necessarily the best way to do this. You only need to write out enough of the characters to ensure that each edge has some subrange of characters to point at. This would likely be a massive savings, but I don't know how to precisely quantify this because it'll depend on the particular strings being used. I'm guessing (?) you'd knock half the space off, leaving about 30TB of space needed.

So overall, if you use a Patricia trie for your strings, you could probably get away with 30TB of space. Some more optimizations to look at:

Could the indices be reduced from 64 bits down to, say, 48 bits? That could save some space per edge, bringing things down further.
Could you store the text compressed, then decompress the text as needed? That could potentially push things down even further.

Hope this helps!