The more I read about tries the more confused I get for some reason.
What confuses me now is the following:
I have read about 2 types of implementation.
Collection
of nodes that store characters and at the end
of each node use a boolean to determine if we reached a word going
down this path In the first case it is not mentioned but it seems that we must actually keep all the dictionary words (since we indirectly reference them). So we have the array_size*numberOfNodes*lengthOfword + size of dictionary processed
In the latter case we don't need the dictionary since the chars are store directly in the tree. So it seems to me that the second implementation is more space efficient. But I am not sure by how much.
Is my understanding correct on the implementations and is there specific reasons to choose one over the other? Also how could we calculate the space requirements for the second case?
Tries do no store the original words anywhere and instead store them implicitly. The basic structure of a trie is the following: each node in the trie stores
To determine whether a word is in the trie, you start at the root, then follow the appropritately-labeled pointers one at a time. If you arrive at a node marked as a word, then the word exists in the trie. If you arrive at a node that isn't marked or you fall off the trie, the word is not present.
The difference between the two structures you have listed above is how the child pointers are stored. In the first version, the child pointers are stored as an array of one pointer per symbol in the alphabet, which makes following child pointers extremely fast but can be extremely space-inefficient. In the second version, you explicitly store some type of collection holding just the labeled pointers you need. This is slower, but is more space efficient for sparse tries.
The space usage of a trie depends on the number of nodes (call it n), size of the alphabet (call it k), and the way in which child pointers are represented. If you store a fixed-sized array of pointers, then the space usage is about kn pointers (n nodes with k pointers each), plus n bits for the markers at each node. If you have, say, a dynamic array of pointers stored in sorted order, the overhead will be n total child pointers, plus n bits, plus n times the amount of space necessary to store a single collection.
The advantage of the first approach is speed and simplicity, with very good performance on dense tries. The second is slower but more memory efficient for sparse tries.
These are not the only space optimizations possible. Patricia tries compress nodes with just one child together and are very space-efficient. DAWGs try to merge as many nodes as possible together, but do not support efficient insertions.
Hope this helps!