I have already seen this answer to a similar question: https://stackoverflow.com/a/44311921/5881884
Where the ahocorasick algorithm is used to show if each word in a list exists in a string or not with O(n). But I want to get the frequency of each word in a list in a string.
For example if
my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]
I would want the result:
[2, 3, 1, 0]
I did not find an exact example for this in the documentation, any idea how to accomplish this?
Other O(n) solutions than using ahocorasick would also be appreciated.
Implementation:
Here's an Aho-Corasick frequency counter:
import ahocorasick
def ac_frequency(needles, haystack):
frequencies = [0] * len(needles)
# Make a searcher
searcher = ahocorasick.Automaton()
for i, needle in enumerate(needles):
searcher.add_word(needle, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(haystack):
frequencies[i] += 1
return frequencies
(For your example, you'd call ac_frequency(my_list, my_string)
to get the list of counts)
For medium-to-large inputs this will be substantially faster than other methods.
Notes:
For real data, this method will potentially yield different results than the other solutions posted, because Aho-Corasick looks for all occurrences of the target words, including substrings.
If you want to find full-words only, you can call searcher.add_word
with space/punctuation-padded versions of the original string:
...
padding_start = [" ", "\n", "\t"]
padding_end = [" ", ".", ";", ",", "-", "–", "—", "?", "!", "\n"]
for i, needle in enumerate(needles):
for s, e in [(s,e) for s in padding_start for e in padding_end]:
searcher.add_word(s + needle + e, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(" " + haystack + " "):
...