Search code examples
regexregex-lookarounds

How to use negative lookahead to ignore multiple word phrases with regular expressions (regex)?


I am doing a task to identify and count occurrences of specific words and phrases in medical Radiology text reports using regular expressions implemented in Python3.

I have regexes working for most of the words/phrases of interest but not for the word 'mass'. This is because the word 'mass' appears in lots of different contexts, some of which I wish to ignore and others I wish to count. I.e. I want to know when 'mass' described an abnormal structure present in the heart, but I don't want to know when 'mass' is being used to tell me the weight of the heart.

For example: "There is a 2mm mass in the RA" is valid "large irregular mass in the left ventricle" is valid "LV mass: 135g" is invalid "Normal heart mass" is invalid

For this I'm using negative lookahead to specify examples of words to ignore before 'mass'.

\b((?:(?!\bLV\s\b|\band\b|\bindexed\b\s)[\S]){2,}\smass)\b

but there are some instances where I need to look at more than just the word immediately preceding 'mass'

e.g. "cardiac mass" is valid (I need to know that there is an abnormal structure in the heart!) but "no cardiac mass" is invalid (I don't want to know that there is no abnormality!)

I have tried things like:

\b((?:(?!\bLV\s\b|\band\b|\bnormal\b|\b[Nn]o\s\b[Cc]ardiac\b|\bindexed\b\s)[\S]){2,}\smass)\b

But it won't appreciate the fact that "cardiac mass" is valid but "no cardiac mass" is invalid.

I'm basically looking to ignore more than one word before the word of interest.

The text is not stereotyped so I'm looking for a solution that I can build up by adding more and more instances to ignore as I come across words/phases preceding "mass" that I wish to exclude.

Grateful for your help


Solution

  • You can use

    \b(?!(?:LV|and|normal|(?<=\bno\s+)cardiac|indexed|heart)\b)\w+\s+mass\b
    

    See the regex demo. Details:

    • \b - word boundary
    • (?!(?:LV|and|normal|(?<=\bno\s+)cardiac|indexed|heart)\b) - a negative lookahead that matches a location that is not immediately followed with
      • (?:LV|and|normal|(?<=\bno\s+)cardiac|indexed|heart) - LV, and, normal, cardiac that is immediately preceded with whole word no and one or more whitespaces, indexed, heart`
      • \b - word boundary
    • \w+ - one or more word chars
    • \s+ - one or more whitespaces
    • mass - a mass string
    • \b - word boundary

    Note: to use this in Python, you need to open the terminal/console, run pip install regex (or pip3 install regex) and then in your code, use import regex as re.