Search code examples
c++data-structureshashtabletrietreap

Which data structure is most suitable to implement a Dictionary?


I have to write a Dictionary program as a semester project for an undergraduate course on Data Structures and Algorithms, and I am expected to find the most suitable solution (Data Structure) to the problem.

I considered using either a hash table or a trie. I have been suggested to use treaps by someone but have not been able to look into them yet.

My database has about 100k distinct words and their meanings. The basic functionalities the program is expected to provide are insert, update, remove and search a word/definition. If I manage to squeeze in auto-completion and spell correction, that'd be an added bonus.

So, my question is, keeping in mind my requirements, which data structure would be best suited for my purposes. When I say 'best', I am asking for the data structure which has best runtime complexity and low cost (memory requirements).

Also, I wanted to be able to have an algorithm which returned all words starting with the given prefix. For example, say I make a function call dictionary.getWordsStartingWith("fic") it should return a list of all words that start with fic such as fiction, fictitious,fickle etc. I know I can do this if I implement my dictionary as a trie, I could do this, but is this possible to do it with a hash table?


Solution

  • You almost certainly want a trie if you want to do auto completion/prefix matching. Hash tables don't really make this possible; in fact good hash functions are designed such that even very similar keys (e.g. same prefix) map to completely different parts of the array. For hashing purposes this is considered a feature.

    Treaps are basically binary search trees that use stochasticity + heap property to do their balancing. In general the interface is the standard BST tree interface; so it's really just an implementation detail that only leads to moderately different properties than a red black tree or an AVL tree.

    BST's aren't nearly as suited to the problems that you seem to be looking to solve as a trie. BST's tend to be all about following inequalities downwards, whereas trie's are about following equalities downward. When you're dealing with numeric data, inequality comparisons are everything because equality is very rare (since the space of possibilities is huge). With strings, each character has very few possibilities so it makes more sense to exploit equalities, leading to optimizations like not actually storing keys at most nodes.

    In summary, I'd recommending proceeding with tries. They're very heavily used for exactly this sort of thing, and you can find a ton of resources on optimizing them (especially for space) since they're particularly used for text input on mobile where space/cycles are at a premium. It's also a very interesting data structure to learn IMHO, compared to BST's which you a) probably learned about heavily in freshman data structures, and b) Isn't really that interesting of a data structure; everything other than the balancing scheme is trivial and the balancing schemes are more tedious than anything else (RB trees have something like 7 truly distinct cases for balancing or something like that, pretty hard to code a RB tree and get them all exactly right).

    The wikipedia page has some good info: https://en.wikipedia.org/wiki/Trie. Bitwise tries look especially interesting.