java string algorithm data-structures trie

Space differences in implementations of a trie

The more I read about tries the more confused I get for some reason.
What confuses me now is the following:
I have read about 2 types of implementation.

Using arrays to represent the characters (not storing the characters itself) and in each node also store the index to the actual word (if we reached a word).
Using a Collection of nodes that store characters and at the end of each node use a boolean to determine if we reached a word going down this path

In the first case it is not mentioned but it seems that we must actually keep all the dictionary words (since we indirectly reference them). So we have the array_size*numberOfNodes*lengthOfword + size of dictionary processed

In the latter case we don't need the dictionary since the chars are store directly in the tree. So it seems to me that the second implementation is more space efficient. But I am not sure by how much.
Is my understanding correct on the implementations and is there specific reasons to choose one over the other? Also how could we calculate the space requirements for the second case?

Solution

Tries do no store the original words anywhere and instead store them implicitly. The basic structure of a trie is the following: each node in the trie stores

A single bit determining whether or not the path that arrives at the node forms a word, and
A collection of pointers to child nodes labeled by characters.

To determine whether a word is in the trie, you start at the root, then follow the appropritately-labeled pointers one at a time. If you arrive at a node marked as a word, then the word exists in the trie. If you arrive at a node that isn't marked or you fall off the trie, the word is not present.

The difference between the two structures you have listed above is how the child pointers are stored. In the first version, the child pointers are stored as an array of one pointer per symbol in the alphabet, which makes following child pointers extremely fast but can be extremely space-inefficient. In the second version, you explicitly store some type of collection holding just the labeled pointers you need. This is slower, but is more space efficient for sparse tries.

The space usage of a trie depends on the number of nodes (call it n), size of the alphabet (call it k), and the way in which child pointers are represented. If you store a fixed-sized array of pointers, then the space usage is about kn pointers (n nodes with k pointers each), plus n bits for the markers at each node. If you have, say, a dynamic array of pointers stored in sorted order, the overhead will be n total child pointers, plus n bits, plus n times the amount of space necessary to store a single collection.

The advantage of the first approach is speed and simplicity, with very good performance on dense tries. The second is slower but more memory efficient for sparse tries.

These are not the only space optimizations possible. Patricia tries compress nodes with just one child together and are very space-efficient. DAWGs try to merge as many nodes as possible together, but do not support efficient insertions.

Hope this helps!