I trying to detect all occurrences of a certain string, that is not surrounded by certain strings (using regex lookarounds). Eg. all occurrences of "African" but not "South African Society". See a simplified example below.
#My example text:
text <- c("South African Society", "South African",
"African Society", "South African Society and African Society")
#My code examples:
str_detect(text, "(?<!South )African(?! Society)")
#or
grepl("(?<!South )African(?! Society)", perl=TRUE , text)
#I need:
[1] FALSE TRUE TRUE TRUE
#instead of:
[1] FALSE FALSE FALSE FALSE
The problem seems to be that regex evaluates the lookbehind and the lookahead separately and not as a whole. It should require both conditions and not only one.
The (?<!South )African(?! Society)
pattern matches African
when it is not preceded with neither South
nor Society
. If there is South
or Society
there will be no match.
There are several solutions.
African(?<!South African(?= Society))
See the regex demo. Here, African
is only matched when the regex engine does not find South African
at the position after matching African
substring that is immediately followed with space and Society
. Using this check after African
is more efficient in case there are longer strings that do not match the pattern than moving it before the word African
(see the (?<!South (?=African Society))African
regex demo).
Alternatively, you may use a SKIP-FAIL technique:
South African Society(*SKIP)(*F)|African
See another regex demo. Here, South African Society
is matched first, and (*SKIP)(*F)
makes this match fail and proceed to the next match, so African
is matched in all contexts other than South African Society
.