Search code examples
rregexregex-lookaroundspositive-lookahead

Regex includes Lookahead strings in selection


I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.

Here is the link to the sample excel file with 2 of those echo reports.

The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.

I wrote the following pattern:

pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
                               ignore_case = FALSE)

Now, let's look at the results (remember I want the "Mild" part not the "LV" part):

str_view_all(df$echo, pattern)

As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.

Anyone knows what am I doing wrong?


Solution

  • The problem is that \w+ matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).

    So, the LV gets matched with \w+ as there is diastolic dysfunction right after it, and ( lv)? is an optional group (there may be no space+lv right before diastolic dysfunction) for the \w+ to match).

    If you do not want to match LV, add a negative lookahead to restrict what \w+ matches:

    \b(?!lv\b)\w+\b(?=(?:\s+lv)?\s+d(?:[iy]a|i)stolic d[yi]sfunction)
    

    See the regex demo

    Also, note that [iy] is a better way to write (i|y).

    In R, you may define it as

    pattern <- regex(
        "\\b(?!lv\\b)\\w+\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)",
        ignore_case = FALSE
    )