Search code examples
javapythontriepatricia-trieradix

Implementing a Patricia Trie for use as a dictionary


I'm attempting to implement a Patricia Trie with the methods addWord(), isWord(), and isPrefix() as a means to store a large dictionary of words for quick retrieval (including prefix search). I've read up on the concepts but they just aren't clarifying into an implementation. I want to know (in Java or Python code) how to implement the Trie, particularly the nodes (or should I implement it recursively). I saw one person who implemented it with an array of 26 child nodes set to null/None. Is there a better strategy (such as treating the letters as bits) and how would you implement it?


Solution

  • Someone else asked a question about Patricia tries a while ago and I thought about making a Python implementation then, but this time I decided to actually give it a shot (Yes, this is way overboard, but it seemed like a nice little project). What I have made is perhaps not a pure Patricia trie implementation, but I like my way better. Other Patricia tries (in other languages) use just a list for the children and check each child to see there is a match, but I thought this was rather inefficient so I use dictionaries. Here is basically how I've set it up:

    I'll start at the root node. The root is just a dictionary. The dictionary has keys that are all single characters (the first letters of words) leading to branches. The values corresponding with each key are lists where the first item is a string which gives the rest of the string that matches with this branch of the trie, and the second item is a dictionary leading to further branches from this node. This dictionary also has single character keys that correspond with the first letter of the rest of the word and the process continues down the trie.

    Another thing I should mention is that if a given node has branches, but also is a word in the trie itself, then that is denoted by having a '' key in the dictionary that leads to a node with the list ['',{}].

    Here's a small example that shows how words are stored (the root node is the variable _d):

    >>> x = patricia()
    >>> x.addWord('abcabc')
    >>> x._d
    {'a': ['bcabc', {}]}
    >>> x.addWord('abcdef')
    >>> x._d
    {'a': ['bc', {'a': ['bc', {}], 'd': ['ef', {}]}]}
    >>> x.addWord('abc')
    {'a': ['bc', {'a': ['bc', {}], '': ['', {}], 'd': ['ef', {}]}]}
    

    Notice that in the last case, a '' key was added to the dictionary to denote that 'abc' is a word in a addition to 'abcdef' and 'abcabc'.

    Source Code

    class patricia():
        def __init__(self):
            self._data = {}
    
        def addWord(self, word):
            data = self._data
            i = 0
            while 1:
                try:
                    node = data[word[i:i+1]]
                except KeyError:
                    if data:
                        data[word[i:i+1]] = [word[i+1:],{}]
                    else:
                        if word[i:i+1] == '':
                            return
                        else:
                            if i != 0:
                                data[''] = ['',{}]
                            data[word[i:i+1]] = [word[i+1:],{}]
                    return
    
                i += 1
                if word.startswith(node[0],i):
                    if len(word[i:]) == len(node[0]):
                        if node[1]:
                            try:
                                node[1]['']
                            except KeyError:
                                data = node[1]
                                data[''] = ['',{}]
                        return
                    else:
                        i += len(node[0])
                        data = node[1]
                else:
                    ii = i
                    j = 0
                    while ii != len(word) and j != len(node[0]) and \
                          word[ii:ii+1] == node[0][j:j+1]:
                        ii += 1
                        j += 1
                    tmpdata = {}
                    tmpdata[node[0][j:j+1]] = [node[0][j+1:],node[1]]
                    tmpdata[word[ii:ii+1]] = [word[ii+1:],{}]
                    data[word[i-1:i]] = [node[0][:j],tmpdata]
                    return
    
        def isWord(self,word):
            data = self._data
            i = 0
            while 1:
                try:
                    node = data[word[i:i+1]]
                except KeyError:
                    return False
                i += 1
                if word.startswith(node[0],i):
                    if len(word[i:]) == len(node[0]):
                        if node[1]:
                            try:
                                node[1]['']
                            except KeyError:
                                return False
                        return True
                    else:
                        i += len(node[0])
                        data = node[1]
                else:
                    return False
    
        def isPrefix(self,word):
            data = self._data
            i = 0
            wordlen = len(word)
            while 1:
                try:
                    node = data[word[i:i+1]]
                except KeyError:
                    return False
                i += 1
                if word.startswith(node[0][:wordlen-i],i):
                    if wordlen - i > len(node[0]):
                        i += len(node[0])
                        data = node[1]
                    else:
                        return True
                else:
                    return False
    
        def removeWord(self,word):
            data = self._data
            i = 0
            while 1:
                try:
                    node = data[word[i:i+1]]
                except KeyError:
                    print "Word is not in trie."
                    return
                i += 1
                if word.startswith(node[0],i):
                    if len(word[i:]) == len(node[0]):
                        if node[1]:
                            try:
                                node[1]['']
                                node[1].pop('')
                            except KeyError:
                                print "Word is not in trie."
                            return
                        data.pop(word[i-1:i])
                        return
                    else:
                        i += len(node[0])
                        data = node[1]
                else:
                    print "Word is not in trie."
                    return
    
    
        __getitem__ = isWord
    

    You may have noticed that at the end I set __getitem__ to the isWord method. This means that

    x['abc']
    

    will return whether 'abc' in the trie or not.

    I think that maybe I should make a module out of this and submit it to PyPI, but it needs more testing and at least a removeWord method. If you find any bugs let me know, but it seems to be working pretty well. Also, if you see any big improvements in efficiency I would also like to hear about them. I've considered doing something about having empty dictionaries at the bottom of each branch, but I'm leaving it for now. These empty dictionaries may be replaced with data linked to the word to expand the uses of the implementation for instance.

    Anyway, if you don't like the way I implemented it, at least maybe this will give you some ideas about how you would like to implement your own version.