Search code examples
pythonalgorithmtreecontains

Check if a string contains an elemewnt in a large list quickly using a tree


I have a large list of short strings (words) and I want to check if any of them appear inside another string (sentence). Note, I don't care about actual words/spaces/punctuation/etc.

This is the typical solution in python:

def contains_one_of(sentence, words):
    for word in words:
        if word in sentence:
            return word
    return None

I've seen some one python one-liners to do the same thing, but algorithmically everything I can find seems to be basically calling a contains function for all elements. And I assume the contains function uses a sort of sliding window approach.

Complexity by my reckoning is roughly O(nmo)

Where n = length of list, m = length of sentence, o = avg length of word in list

To me I think this can be improved with a tree but I can't find any reference to such an algorithm. I basically envision the array of words becoming a tree, where a node is a letter and all of its children are the next letter of the word. So long as the words are short and have reasonable overlap I think this would be more efficient.

I've implemented a version of this in python but I'd much rather use a package that leverages C for comparing all of those characters. If you know the name of this algorithm or a package that does this I would love to know.

Here is my version, I'm sure lots can by optimized but I'd like to know if I'm on to something here or not.

sentence = "hello there cat, welcome home"
words = ["cat", "car", "cam", "arm", "ace", "arc"]

# build a dict tree per letter
def build_tree(patterns):
    root = dict()
    for p in patterns:
        r = root
        for i, c in enumerate(p):
            if c not in r:
                if i >= len(p) - 1: # last element
                    r[c] = p
                else: # any other element
                    r[c] = dict()
            r = r[c]
    return root
            
# Check if the substring starts with a path through the tree
def starts_with_tree(sub, tree):
    level = tree
    for c in sub:
        if c not in level: # nowhere left to go
            return None
        elif isinstance(level[c], str): # if we found a string we are at the end
            return level[c]
        else:
            level = level[c] # go deeper
            

# Check if s contains any path through the tree
def contains_which(s, root):
    for i in range(len(s)):
        sub = s[i:] # A substring missing the first i characters
        result = starts_with_tree(sub, root) 
        if result:
            return result
    return None
        

# build the tree
tree_root = build_tree(words)
print(tree_root)
# search within tree
found = contains_which(sentence, tree_root)
print("Found:", found)

Solution

  • You can use aho-corasick algorithm.

    It use trie structure(a kind of tree) and time complexity is just O(m + o*n)(with your definition)(Linear time complexity with length sum of all strings)

    If you are not familiar with this algorithm, implementation of this is quite complex. So you can use python library for aho-corasick pyahocorasick

    More detail

    Wikipedia

    python aho-corasick library