Search code examples
pythonregexlookbehind

Python Regex: negative lookbehind not directly before target word


I am building an NLP baseline script in Jupyter Notebook that should filter out all 'embolisms' from reports. However, when the word 'no' or 'not' occur in the same line/sentence, I do not want them included. This is easy with regex, once you know where the word will occur, if it occurs. But there can be many words in between.

  • Example: The scan has shown an embolism present; should be included
  • Example: No embolism has been found; should be excluded (this is easy with Regex)
  • Problem example: Currently no developing, interesting, nice, beautiful embolism has been found; should be excluded, but I have no idea how.

This is the regex for excluding the 'no embolism' when they are together in the sentence:

result = re.findall('(?<!\no )(embolism?\w)', text)

The error occurring with regular regex when extending to multiple words is: "error: look-behind requires fixed-width pattern"

I have googled on how to solve it, but I did not find a solution applicable to this problem. I did also find that installing Regex with pip removes the aforementioned error. However, I'm still wondering whether there is a solution for this problem?

Best,


Solution

  • You can exclude the last 2 by matching them, and capture the first example that you want to keep in a group.

    ^(?:.*\bnot?\b.*\bembolism\b.*|.*\bembolism\b.*\bnot?\b.*)|(.*\bembolism\b.*)$
    

    Explanation

    • ^ Start of string
    • (?: Non capture group
      • .*\bnot?\b.*\bembolism\b.* Match first no or not followed by embolism
      • | Or
      • .*\bembolism\b.*\bnot?\b.* Match it the other way around
    • ) Close non capture group
    • | Or
    • (.*\bembolism\b.*) Capture group 1 (what you want to keep) containing embolism
    • $ End of string

    Regex demo