Some time ago I posted a question about saving/retrieving a Suffix Tree from disk. That's finally working fine, but now the construction is extremely slow, and I don't want to mess with the Ukkonen's algorithm (linear construction) right now.
So, I wanted to make concurrent insertions to accelerate the process without having to make the tree thread-safe.
The Suffix Tree stores words by its initial character (look the image posted on my previous question), thus, the word Banana is in the 'B' child of the root node, and Apple would be in the 'A' child and so forth. So, the insertion of a word starting with 'B' will never interfere with a insertion starting with 'A'. My idea is to have a thread for each initial character of the set of words being inserted: a thread inserting 'A's, another thread inserting 'B's, etc.
So I was thinking about a Executer
class, that just adds words to queue of words of each
Process
(first create it if it doesn't exists).
class Executer:
#...
def concurrent_insertion(word):
k = word[0]
processes.get(k, Process()).add(word)
# ...
And the class Process
is the one that makes the insertions. Each Process
instance is a independent thread, with a Queue
containing the words that still has to insert.
In this Process
class is where I'm having troubles, I'm guessing it should inherit from threading.Thread
, because each instance should be a thread, but how do I keep it alive until all the text processing is done? I mean, it should be inserting words from its Queue
of words, but when the Queue
is empty the thread should not die, just keep waiting till more words fill the Queue
, "wake up" and continue inserting.
Threads won't die until they exit, so you can keep them alive with a while True
loop.
The usual pattern looks like this:
q = Queue.Queue() # word insertion queue
terminate = object() # sentinel value to tell a thread to terminate
def worker(q):
while True:
word = q.get() # block until next word is available
if word is terminate:
break
insert_word(word)
After launching the workers and sending words to the queue, the main thread needs to wait for all the work to be completed and then it should shut down the workers.
for word in wordlist:
q.put(word)
for i in range(numthreads):
q.put(terminate) # terminate all the worker threads
for t in threadlist:
t.join() # wait for them all to finish
An alternative way of waiting for all the work to be done is to use q.task_done
and q.join
. An example for how to use them is shown at the bottom of the page in the docs for the Queue module.