What I did
I wrote a regex that matches all text strings with "A"
and "BV"
with 0-10 words between using this tutorial: https://www.regular-expressions.info/near.html
df<- data.frame(text=c("ART 6 dasd asd NOT art 2 BV","NOT ART 6 ds as dd BV","ART 6 NO BV"),
id=c(1,2,3))
subset(df, grepl("(ART)(?:\\W+\\w+){0,10}?\\W+(\\bBV\\b)",
perl=TRUE,
ignore.case = TRUE,
text))
text id
1 ART 6 dasd asd NOT art 2 BV 1
2 NOT ART 6 ds as dd BV 2
3 ART 6 NO BV 3
What I am trying to get
Now I would like to rewrite the regex that it does not match if there occurs any word of a list (i.e. NOT
and NO
in the example data) in the 0-10 words between "A" and "BV".
So the result would look like:
subset(df, grepl("NEWREGEX",
perl=TRUE,
ignore.case = TRUE,
text))
text id
1 NOT ART 6 ds as dd BV 2
I think I could use something like ?!
but I could not figure it out
Thanks to akrun we have a really nice solution:
library(stringr)
str_extract(df$text, "(A\\w+\\b.*\\bBV\\b)") %>% str_detect("NOT?") %>% '!' %>% magrittr::extract(df, ., )