Search code examples
pythonparsingfull-text-searchnegate

Python text search library


I am looking for a library that would let me do something like the following:

matches(
    user_input="hello world how are you what are you doing",
    keywords='+world -tigers "how are" -"bye bye"'
)

Basically I want it to match strings based on presence of words, absence of words and sequences of words. I don't need a search engine a la Solr, because strings will not be known in advance and will only be searched once. Does such a library already exist, and if so, where would I find it? Or am I doomed to creating a regex generator?


Solution

  • regex module supports named lists:

    import regex
    
    def match_words(words, string):
        return regex.search(r"\b\L<words>\b", string, words=words)
    
    def match(string, include_words, exclude_words):
        return (match_words(include_words, string) and
                not match_words(exclude_words, string))
    

    Example:

    if match("hello world how are you what are you doing",
             include_words=["world", "how are"],
             exclude_words=["tigers", "bye bye"]):
        print('matches')
    

    You could implement named lists using standard re module e.g.:

    import re
    
    def match_words(words, string):
        re_words = '|'.join(map(re.escape, sorted(words, key=len, reverse=True)))
        return re.search(r"\b(?:{words})\b".format(words=re_words), string)
    

    how do I build the list of included and excluded words based on the +, -, and "" grammar?

    You could use shlex.split():

    import shlex
    
    include_words, exclude_words = [], []
    for word in shlex.split('+world -tigers "how are" -"bye bye"'):
        (exclude_words if word.startswith('-') else include_words).append(word.lstrip('-+'))
    
    print(include_words, exclude_words)
    # -> (['world', 'how are'], ['tigers', 'bye bye'])