Search code examples
rregexregex-lookaroundsregex-group

Characters are still being captured and highlighted despite using do not capture command


I'm using Regex in R and I am trying to capture a specific measurement of cardiac wall thickness that has 1-2 digits and 0-2 decimals as in:

"maximum thickness of lv wall= 1.5"

yet I want to exclude instances where (after|myectomy|resection) is somewhere after the word "thickness"

So I wrote the following regex code:

pattern <- "(?i)(?<=thickness)(?!(\\s{0,10}[[:alpha:]]{1,100}){0,8}\\s{0,10}(after|myectomy|resection))(?:(?:\\s{0,10}[[:alpha:]]{0,100}){0,8}\\s{0,10}[:=\\(]?)\\d{1,3}\\.?\\d{0,3}"

you can test it against this sample dataframe (every measurement in this example should match, except the last one):

df <- tibble(
  test = c("maximum size of thickness in base to mid of anteroseptal wall(1.7cm)",
           "(anterolateral and inferoseptal wall thickness:1.6cm)",
           "hypertrophy in apical segments maximom thickness=1.6cm with sparing of posterior wall",
           "septal thickness=1cm",
           "LV apical segments with maximal thickness 1.7 cm and dynamic",
           "septal thickness after myectomy=1cm")
)

this regex code works for Matching what I want; the problem is that here I want to capture the measurements only, yet the sections behind the measurement are also getting captured although I have stated otherwise through none-capturing groups ?: .

Check this image out that is a result of stringr::str_view(df$test, pattern):

enter image description here


Solution

  • You can use

    pattern <- "(?i)(?<=\\bthickness(?:\\s{1,10}(?!(?:after|myectomy|resection)\\b)[a-zA-Z]{1,100}){0,8}\\s{0,10}[:=(]?)\\d{1,3}(?:\\.\\d{1,3})?"
    str_view(df$test, pattern)
    

    Output:

    enter image description here

    See the regex demo (JavaScript engine in modern browsers supports unlimited length lookbehind).

    Details:

    • (?<= - start of the positive lookbehind that requires the following sequence of patterns to match immediately to the left of the current location:
      • \bthickness - whole word thickness
      • (?:\s{1,10}(?!(?:after|myectomy|resection)\b)[a-zA-Z]{1,100}){0,8} - zero to eight occurrences of
        • \s{1,10} - one to ten whitespaces
        • (?!(?:after|myectomy|resection)\b) - no after, mectomy and resection words are allowed immediately to the right of the current location
        • [a-zA-Z]{1,100} - 1 to 100 ASCII letters
      • \s{0,10} - zero to ten whitespaces
      • [:=(]? - an optional :, = or ( char
    • ) - end of the positive lookbehind
    • \d{1,3} - one to three digits
    • (?:\.\d{1,3})? - an optional sequence of a . and then one to three digits.