Search code examples
pythonstringfull-text-searchpython-re

Python - Fast count words in text from list of strings and that start with


I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.

I have a set of words, some of them end with the "*" and some others don't:

words = set(["apple", "cat*", "dog"])

I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:

text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".

I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).

I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?

EDIT

I forgot to mention thas some of my words contain punctuation, see here for e.g.:

words = set(["apple", "cat*", "dog", ":)", "I've"])

This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?


Solution

  • You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with *. Compiling the regex should help performance:

    import re
    
    words = set(["apple", "cat*", "dog"])
    text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"
    
    regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
    matches = regex.findall(text)
    print(len(matches))
    

    Output:

    4