Search code examples
pythonconcurrencysuffix-tree

Concurrent insertions in a Suffix Tree


Some time ago I posted a question about saving/retrieving a Suffix Tree from disk. That's finally working fine, but now the construction is extremely slow, and I don't want to mess with the Ukkonen's algorithm (linear construction) right now.

So, I wanted to make concurrent insertions to accelerate the process without having to make the tree thread-safe.

The Suffix Tree stores words by its initial character (look the image posted on my previous question), thus, the word Banana is in the 'B' child of the root node, and Apple would be in the 'A' child and so forth. So, the insertion of a word starting with 'B' will never interfere with a insertion starting with 'A'. My idea is to have a thread for each initial character of the set of words being inserted: a thread inserting 'A's, another thread inserting 'B's, etc.

So I was thinking about a Executer class, that just adds words to queue of words of each Process (first create it if it doesn't exists).

class Executer:
    #...
    def concurrent_insertion(word):
        k = word[0]
        processes.get(k, Process()).add(word)
    # ...

And the class Process is the one that makes the insertions. Each Process instance is a independent thread, with a Queue containing the words that still has to insert.

In this Process class is where I'm having troubles, I'm guessing it should inherit from threading.Thread, because each instance should be a thread, but how do I keep it alive until all the text processing is done? I mean, it should be inserting words from its Queue of words, but when the Queue is empty the thread should not die, just keep waiting till more words fill the Queue, "wake up" and continue inserting.


Solution

  • Threads won't die until they exit, so you can keep them alive with a while True loop.

    The usual pattern looks like this:

    q = Queue.Queue()             # word insertion queue
    terminate = object()          # sentinel value to tell a thread to terminate
    
    def worker(q):
        while True:
             word = q.get()       # block until next word is available
             if word is terminate:
                 break
             insert_word(word)
    

    After launching the workers and sending words to the queue, the main thread needs to wait for all the work to be completed and then it should shut down the workers.

    for word in wordlist:
        q.put(word)
    for i in range(numthreads):
        q.put(terminate)          # terminate all the worker threads
    for t in threadlist:
        t.join()                  # wait for them all to finish
    

    An alternative way of waiting for all the work to be done is to use q.task_done and q.join. An example for how to use them is shown at the bottom of the page in the docs for the Queue module.