Search code examples
pythontriedawg

How to create a trie in Python


I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.

  • Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
  • Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
  • How to implement word-blocks consisting of more than one word separated with - or space?
  • How to link prefix or suffix of a word to another part in the structure? (for DAWG)

I want to understand the best output structure in order to figure out how to create and use one.

I would also appreciate what should be the output of a DAWG along with trie.

I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.


Solution

  • Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:

    >>> _end = '_end_'
    >>> 
    >>> def make_trie(*words):
    ...     root = dict()
    ...     for word in words:
    ...         current_dict = root
    ...         for letter in word:
    ...             current_dict = current_dict.setdefault(letter, {})
    ...         current_dict[_end] = _end
    ...     return root
    ... 
    >>> make_trie('foo', 'bar', 'baz', 'barz')
    {'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}}, 
                 'z': {'_end_': '_end_'}}}, 
     'f': {'o': {'o': {'_end_': '_end_'}}}}
    

    If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.)

    Next, a function to test whether the word is in the trie:

    >>> def in_trie(trie, word):
    ...     current_dict = trie
    ...     for letter in word:
    ...         if letter not in current_dict:
    ...             return False
    ...         current_dict = current_dict[letter]
    ...     return _end in current_dict
    ... 
    >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
    True
    >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
    True
    >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
    False
    >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
    False
    >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
    False
    

    I'll leave insertion and removal to you as an exercise.

    Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.

    Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.