Search code examples
pythonregextext

Building a regular expression to find text near each other


I'm having issue getting this search to work:

import re

word1 = 'this'
word2 = 'that'
sentence = 'this and that'

print(re.search('(?:\b(word1)\b(?: +[^ \n]*){0,5} *\b(word2)\b)|(?:\b(word2)\b(?: +[^ \n]*){0,5} *\b(word1)\b)',sentence))

I need to build a regex search to find if a string has up to 5 different sub-strings in any order within a certain number of other words (so two strings could be 3 words apart, three strings a total of 6 words apart, etc).

I've found a number of similar questions such as Regular expression gets 3 words near each other. How to get their context? or How to check if two words are next to each other in Python?, but none of them quite do this.

So if the search words were 'this', 'that', 'these', and 'those' and they appeared within 9 words of each other in any order, then the script would output True.

It seems like writing an if/else block with all sorts of different regex statements to accommodate the different permutations would be rather cumbersome, so I'm hoping there is a more efficient way to code this in Python.


Solution

  • ANSWER CHANGED because I found a way to do it with just a regular expression. The approach is to start with a lookahead that requires all target words to be present in the next N words. Then look for a pattern of target words (in any order) separated by 0 or more other words (up to the allowed maximum intermediate words)

    The word span (N) is the greatest number of words that would allow all the target words to be at the maximum allowed distance.

    For example, if we have 3 target words, and we allow a maximum of 4 other words between them, then the maximum word span will be 11. So 3 target words plus 2 intermediate series of maximum 4 other words 3+4+4=11.

    The search pattern is formed by assembling parts that depend on the words and the maximum number of intermediate words allowed.

    Pattern : \bALL((ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}

    breakdown:

    • \b start on a word boundary
    • ALL will be substituted by multiple lookaheads that will ensure that every target word is found in the next N words.
    • each lookahead will have the form (?=(\w+\W*){0,SPAN}WORD\b) where WORD is a target word and SPAN is the number of other words in the longest possible sequence of words. There will be one such lookahead for each of the target words. Thus ensuring that the sequence of N words contains all of target words.
    • (\b(ANY)(\W+\w+\W*){0,INTER}) matches any target word followed by zero to maxInter intermediate words. In that, ANY will be replaced by a pattern that matches any of the target words (i.e. the words separated by pipes). And INTER will be replaced by the allowed number of intermediate words.
    • {COUNT,COUNT} ensured that there are as many repetitions of the above as there are target words. This corresponds to the pattern: targetWord+intermediates+targetWord+intermediates...+targetWord
    • With the look ahead placed before the repeating pattern, we are guaranteed to have all the target words in the sequence of words containing exactly the number of target words with no more intermediate words than is allowed.

    ...

    import re
    
    words    = {"this","that","other"}
    maxInter = 3 # maximum intermediate words between the target words
    
    wordSpan = len(words)+maxInter*(len(words)-1)
    
    anyWord  = "|".join(words)
    allWords = "".join(r"(?=(\w+\W*){0,SPAN}WORD\b)".replace("WORD",w) 
                        for w in words)
    allWords = allWords.replace("SPAN",str(wordSpan-1))
                        
    pattern = r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"
    pattern = pattern.replace("COUNT",str(len(words)))
    pattern = pattern.replace("INTER",str(maxInter))
    pattern = pattern.replace("ALL",allWords)
    pattern = pattern.replace("ANY",anyWord)
    
    
    textList = [
       "looking for this and that and some other thing", # YES
       "that rod is longer than this other one",         # NO: 4 words apart
       "other than this, I have nothing",                # NO: missing "that"
       "ignore multiple words not before this and that or other", # YES
       "this and that or other, followed by a bunch of words",    # YES
               ] 
    

    output:

    print(pattern)
    
    \b(?=(\w*\b\W+){0,8}this\b)(?=(\w*\b\W+){0,8}other\b)(?=(\w*\b\W+){0,8}that\b)(\b(other|this|that)\b(\w*\b\W+){0,3}){3,3}
    
    for text in textList:
        found = bool(re.search(pattern,text))
        print(found,"\t:",text)
    
    True    : looking for this and that and some other thing
    False   : that rod is longer than this other one
    False   : other than this, I have nothing
    True    : ignore multiple words not before this and that or other
    True    : this and that or other, followed by a bunch of words