Search code examples
rregexnlp

Regular Expression Question - Two Negative Look behinds in the same expression


I have the following problem for which I have been working on for a few hours. I am trying to build the following RegEx :

I want to be able to extract the word reduced from sentences but not if the word is preceded by a negative expression.

For example

Sentences                                |Output
1. lv function is reduced                    reduced
2. lv function is not reduced                -
3. reduced lv function                       reduced
4. no evidence of reduced lv function        -

Right now, I have been able to a have a function RegEx for in the cases 3 and 4 where the adjective precedes the noun of interest using a negative look behind.

However, for the cases 1 and 2, the negative look behind does not work.

Here are sentences and the current RegEx to test :

((?<!((no|not|none)(?:\D*?)))(reduced|depressed|normal)(?:\D*?))?(?:lv function|lv|systolic function|left ventricular ejection fraction)(((?:.*\bnot\b)(\D*))(reduced|depressed|normal))?

Sentences :
lv function is reduced   
lv function is not reduced 
reduced lv function
no evidence of reduced lv function   

Alternatively here is a link : regexr.com/4tc61

Also, I am ultimately going to be working in R.

Thank you all.


Solution

  • The regex solution will be very complex, and you may use it only if you understand it well. I will try to explain it as well as I can.

    Q: How do I match something that is not preceded with a string of unknown length if my lookbehinds do not support such patterns?
    A: Match what you do not need, skip the matched texts, and go on matching from the position where the match failed.

    You may do it with PCRE regex that supports (*SKIP)(*FAIL) (or shorter (*SKIP)(*F)) construct.

    Now, look at the pattern:

    (?:\b(?:no|not|none)\b\D*?\b(?:reduced|depressed|normal)\b\D*?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b|\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b\D*?\bnot\b\D*?\b(?:reduced|depressed|normal)\b)(*SKIP)(*F)|(?:\b(reduced|depressed|normal)\b\D*?)?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b(?:\D*?\b(reduced|depressed|normal)\b)?
    

    Looks unwieldly, but let's go through the constituents:

    • (?: - start of a non-capturing group that serves as a container, the (*SKIP)(*F) will be applied to all alternatives inside it):
      • \b(?:no|not|none)\b\D*?\b(?:reduced|depressed|normal)\b\D*?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b:
        • \b(?:no|not|none)\b - any of the words inside the non-capturing group as whole words
        • \D*? - 0+ non-digit chars
        • \b(?:reduced|depressed|normal)\b - any of the words inside the non-capturing group as whole words
        • \D*? - 0+ non-digit chars
        • \b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b - any of the texts inside the non-capturing group as whole words
      • | - or
      • \b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b\D*?\bnot\b\D*?\b(reduced|depressed|normal)\b:
        • \b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b - any of the texts inside the non-capturing group as whole words
        • \D*?\bnot\b\D*? - 0+ non-digits as few as possible, whole word not, 0+ non-digits as few as possible
        • \b(?:reduced|depressed|normal)\b - any of the texts inside the non-capturing group as whole words
    • )(*SKIP)(*F) - end of the container group, and the PCRE verbs that fail the match, making the regex engine go on to search for matches starting at the position where the match failed
    • | - or (that is, now, really match what we need with the next alternative)
    • (?:\b(reduced|depressed|normal)\b\D*?)?\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b(?:\D*?\b(reduced|depressed|normal)\b)?:
      • (?:\b(reduced|depressed|normal)\b\D*?)? - an optional non-capturing group matching a reduced, depressed or normal captured into Group 1 (we need to extract the word matched with this group!) as whole words and then any 0+ non-digit chars as few as possible
      • \b(?:lv function|lv|systolic function|left ventricular ejection fraction)\b - any of the texts inside the non-capturing group as whole words
      • (?:\D*?\b(reduced|depressed|normal)\b)? - an optional non-capturing group matching any 0+ non-digit chars as few as possible and then captures into Group 2 a reduced, depressed or normal captured into Group 1 as a whole word.

    There are so many repetitive parts, so it makes sense to use variables in the pattern:

    x <- c("lv function is reduced", "lv function is not reduced", "reduced lv function", "no evidence of reduced lv function")
    
    cap <- "reduced|depressed|normal"
    negate_prefix <- paste0("(?:\\b(?:no|not|none)\\b\\D*?\\b(?:",cap,")\\b\\D*?")
    match <- "\\b(?:lv function|lv|systolic function|left ventricular ejection fraction)\\b"
    regex <- paste0(negate_prefix,
       match, "|", match, "\\D*?\\bnot\\b\\D*?\\b(?:",cap,")\\b)(*SKIP)(*F)|(?:\\b(",cap,")\\b\\D*?)?",match,"(?:\\D*?\\b(",cap,")\\b)?")
    

    So, all we need is the captured substrings. See the R demo online:

    results <- regmatches(x, regexec(regex, x, perl=TRUE))
    unlist(lapply(results, function(x) paste(x[-1], collapse="")))
    ## => [1] "reduced" ""        "reduced" ""