Problem definition: Given a text of length n
characters and a list of terms (which may be regular expressions) of length t
, find and count all the occurrences of terms in the text.
Here is a naive implementation for that:
class WordFrequency(TextAnalysis):
"""Determines the frequency of words from a vocabulary in a given text"""
def __init__(self, vocabulary, text):
"""
:param vocabulary: contains the words (e.g. list)
:param text: the analysed text
"""
self.text = text
self.vocabulary = vocabulary
self.matches = {}
def run(self):
"""
:return: self for method chaining
"""
ltext = self.text.lower()
self.freq = {} # word -> absolute frequency
for word in self.vocabulary:
matches = re.findall(r'\b' + word + r'\b', ltext)
self.matches[word] = [match for match in matches] #.lstrip() for match in matches]
self.freq[word] = len(matches)
return self
Now this takes about 6 seconds for a text of length ca. 35000 characters and a list of ca. 5000 terms, which is too slow. It seems that the time complexity of this is O(t * n)
because for each of the t
terms the text has to be scanned once. Is there an obvious performance bug here? What are possible optimizations and/or better algorithms?
This can be made to work in n O(t * log(n)). I currently have two implementations of this running in production
Implementation #1:
Done in pure Python. I constructed a search tree from the (smaller) pattern file, where each node of the tree is a letter that links to a hash of possible next letters. For example, you have three patterns: cat, dog and dodge. The following tree is being automatically constructed in O(n):
{
'c': {'a': {'t': 'cat'}},
'd': {'d': {'g': {'e': 'dodge'}},
'o': {'g': 'dog'}}
}
You can now scan text and look up every word (or every character) in this lookup tree in O(log(n)).
I did not support regex for this solution, although it is possible. The downside is that Python does not have good performance for this and the hash tree is inefficient in how much memory it consumes. I contemplated using Pypy, rewriting it in Perl or C and doing multiprocessing.
Implementation #2:
A well known tool called grep
already does all of the above. It supports regular expressions and can accept a file of patterns. For some reason it does not like large files of patterns and its performance exponentially degrades with increase of pattern file. This could be because I use regex heavily. I ended up splitting pattern file in multiple slices and feeding them to grep in parallel processes. For my applications grep works 10X faster. Note: set environment variable $LANG to ' ', as grep is hampered by severe localization slowness.
Conclusion:
Building a targeted engine in C would be ideal, but by taking a working and widely available GPL tool can save you a few months your life.