Search code examples
rregexregex-lookaroundslookbehind

Regular expression (regex lookarounds) to detected a certain string not between certain strings (lookahead & lookbehind, word not surrounded by words)


I trying to detect all occurrences of a certain string, that is not surrounded by certain strings (using regex lookarounds). Eg. all occurrences of "African" but not "South African Society". See a simplified example below.

#My example text:
text <- c("South African Society", "South African", 
"African Society", "South African Society and African Society")

#My code examples:
str_detect(text, "(?<!South )African(?! Society)")
#or
grepl("(?<!South )African(?! Society)",  perl=TRUE , text)

#I need:
[1] FALSE TRUE TRUE TRUE 

#instead of:
[1] FALSE FALSE FALSE FALSE

The problem seems to be that regex evaluates the lookbehind and the lookahead separately and not as a whole. It should require both conditions and not only one.


Solution

  • The (?<!South )African(?! Society) pattern matches African when it is not preceded with neither South nor Society. If there is South or Society there will be no match.

    There are several solutions.

     African(?<!South African(?= Society))
    

    See the regex demo. Here, African is only matched when the regex engine does not find South African at the position after matching African substring that is immediately followed with space and Society. Using this check after African is more efficient in case there are longer strings that do not match the pattern than moving it before the word African (see the (?<!South (?=African Society))African regex demo).

    Alternatively, you may use a SKIP-FAIL technique:

    South African Society(*SKIP)(*F)|African
    

    See another regex demo. Here, South African Society is matched first, and (*SKIP)(*F) makes this match fail and proceed to the next match, so African is matched in all contexts other than South African Society.