Search code examples
c#algorithmuniquewordssuffix-tree

Effective approach on fast look up of unique words in C#


I have the following problem. I have to store a list of unique words in multiple languages in memory and of course when I add new words I have to check whether the new word already exist.

Of course this needs to be blazingly fast, primarily because of the huge number of words.

I was thinking about implementing a Suffix Tree, but I wondered whether there is an easier approach with some already implemented internal structures.

P.S. Number of words ≈ 107.


Solution

  • First, note that Suffix Trees might be an overkill here, since they allow fast search for any suffix of any word, which might be a bit too much than what you are looking for. A trie is a very similar DS, that also allows fast search for a word, but since it does not support fast search for any suffix - its creation is simpler (both to program and efficiency).

    Another simpler alternative is using a simple hash table, which is implemented in C# as a HashSet. While a HashSet is on theory slower on worst case - the average case for each lookup takes constant time, and it might be enough for your application.

    My suggestion is:

    1. First try using a HashSet, which requires much less effort to implement, benchmark it and check if it is enough.
    2. Make sure your DS is moddable, so you can switch it with very little effort if you later decide to. This is usually done by introducing an interface that is responsible to the addition and lookup of words, and if you need it changed - just introduce a different implementation to the interface.
    3. If you do decide to add suffix tree or trie - use community resources, no need to reinvent the wheel - someone has already implemented most of these data structures, and they are available online.