Search code examples
rregextidyverseregex-lookarounds

Look around pattern doesn't occasionally work


I'm using regex in R and working on a echocardiographic dataset. I want to detect cases where a phenomena called "SAM" is seen and I obviously would want to exclude cases like "no SAM"

so I wrote this lines:

pattern_sam <- regex("(?<!no )sam", ignore_case = TRUE)
str_view_all(echo_1_lvot$description_echo, pattern_sam, match = TRUE)

it effectively removes 99.9% of cases with "no SAM", yet for some reason I still get 3 cases of "no SAM" (see the following image)

enter image description here

Now the weird thing is that if I simply copy pasting these strings into a new dataset, this problem goes away...

sam_test <- tibble(description_echo = c(
  "There is asymmetric septal hypertrophy severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compatible with type III HCM",
  "-Normal LV size with mild to moderate systolic dysfunction,EF=45%,severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compa"
))

str_view_all(sam_test$description_echo, pattern_sam)

same thing happens when I try to detect other patterns

does anyone have any idea on what is the underlying problem and how can it be fixed?

P.S: here is the .xls file (I only included the problematic string), if you want to see for yourself

funny thing is that when I manually remove the "No SAM" from the .xls and retype it in the exact same place, the problem goes away. still no idea what is wrong, could it be the text format?


Solution

  • You can match any whitespaces, even Unicode ones, with \s since you are using the ICU regex flavor (it is used with all stringr/stringi regex functions):

    pattern_sam <- regex("(?<!no\\s)sam", ignore_case = TRUE)
    

    To match any non-word chars including some non-printable chars, use

    regex("(?<!no\\W)sam", ignore_case = TRUE)
    

    Besides, if there can be several of them, you may use a constrained-width lookbehind (available in ICU and Java):

    pattern_sam <- regex("(?<!no\\s{1,10})sam", ignore_case = TRUE)
    pattern_sam <- regex("(?<!no\\W{1,10})sam", ignore_case = TRUE)
    

    Here, from 1 to 10 chars can be between no and sam.

    And if you need to match whole words, add \b, word boundary:

    pattern_sam <- regex("(?<!\\bno\\s{1,10})sam\\b", ignore_case = TRUE)
    pattern_sam <- regex("(?<!\\bno\\W{1,10})sam\\b", ignore_case = TRUE)