I have a large list of short strings (words) and I want to check if any of them appear inside another string (sentence). Note, I don't care about actual words/spaces/punctuation/etc.
This is the typical solution in python:
def contains_one_of(sentence, words):
for word in words:
if word in sentence:
return word
return None
I've seen some one python one-liners to do the same thing, but algorithmically everything I can find seems to be basically calling a contains function for all elements. And I assume the contains function uses a sort of sliding window approach.
Complexity by my reckoning is roughly O(nmo)
Where n = length of list, m = length of sentence, o = avg length of word in list
To me I think this can be improved with a tree but I can't find any reference to such an algorithm. I basically envision the array of words becoming a tree, where a node is a letter and all of its children are the next letter of the word. So long as the words are short and have reasonable overlap I think this would be more efficient.
I've implemented a version of this in python but I'd much rather use a package that leverages C for comparing all of those characters. If you know the name of this algorithm or a package that does this I would love to know.
Here is my version, I'm sure lots can by optimized but I'd like to know if I'm on to something here or not.
sentence = "hello there cat, welcome home"
words = ["cat", "car", "cam", "arm", "ace", "arc"]
# build a dict tree per letter
def build_tree(patterns):
root = dict()
for p in patterns:
r = root
for i, c in enumerate(p):
if c not in r:
if i >= len(p) - 1: # last element
r[c] = p
else: # any other element
r[c] = dict()
r = r[c]
return root
# Check if the substring starts with a path through the tree
def starts_with_tree(sub, tree):
level = tree
for c in sub:
if c not in level: # nowhere left to go
return None
elif isinstance(level[c], str): # if we found a string we are at the end
return level[c]
else:
level = level[c] # go deeper
# Check if s contains any path through the tree
def contains_which(s, root):
for i in range(len(s)):
sub = s[i:] # A substring missing the first i characters
result = starts_with_tree(sub, root)
if result:
return result
return None
# build the tree
tree_root = build_tree(words)
print(tree_root)
# search within tree
found = contains_which(sentence, tree_root)
print("Found:", found)
It use trie structure(a kind of tree) and time complexity is just O(m + o*n)(with your definition)(Linear time complexity with length sum of all strings)
If you are not familiar with this algorithm, implementation of this is quite complex. So you can use python library for aho-corasick pyahocorasick
More detail