Search code examples
rregexgrepl

R regex: match strings with two words near each other with exception


What I did

I wrote a regex that matches all text strings with "A" and "BV" with 0-10 words between using this tutorial: https://www.regular-expressions.info/near.html

df<- data.frame(text=c("ART 6 dasd asd NOT art 2 BV","NOT ART 6 ds as dd BV","ART 6 NO BV"),
                id=c(1,2,3))



subset(df, grepl("(ART)(?:\\W+\\w+){0,10}?\\W+(\\bBV\\b)",
                   perl=TRUE,
                   ignore.case = TRUE,
                   text))


                         text id
1 ART 6 dasd asd NOT art 2 BV  1
2       NOT ART 6 ds as dd BV  2
3                 ART 6 NO BV  3

What I am trying to get

Now I would like to rewrite the regex that it does not match if there occurs any word of a list (i.e. NOT and NO in the example data) in the 0-10 words between "A" and "BV".

So the result would look like:

subset(df, grepl("NEWREGEX",
                   perl=TRUE,
                   ignore.case = TRUE,
                   text))


                         text id
1        NOT ART 6 ds as dd BV  2

I think I could use something like ?! but I could not figure it out


Solution

  • Thanks to akrun we have a really nice solution:

    library(stringr)
    str_extract(df$text, "(A\\w+\\b.*\\bBV\\b)") %>% str_detect("NOT?") %>% '!' %>% magrittr::extract(df, ., )