Search code examples
rregexstrsplit

Negative lookahead in strsplit puzzling behaviour


I am confused by a simple lookahead behaviour in strsplit in R v3.6.2: when I try to match a space () not followed by a forward-slash (/) the regex behaves oddly.

The below attempt correctly doesn't consume the forward-slash but still splits at the space afterwards. The output is the same with patterns: ' (?!/ )' and ' (?!/ *)', also with other wildcards . and ?.

strsplit(c("foo1 foo2", "foo1 / foo2", "foo1/foo2"), ' (?!/)', perl = T)
[[1]]
[1] "foo1" "foo2"

[[2]]
[1] "foo1 /" "foo2"  

[[3]]
[1] "foo1/foo2"

This is all the more confusing because if I negate a positive lookahead, strsplit simply won't split anything. This persists with different patterns as above.

strsplit(c("foo1 foo2", "foo1 / foo2", "foo1/foo2"), ' ^(?=/)', perl = T)
[[1]]
[1] "foo1 foo2"

[[2]]
[1] "foo1 / foo2"

[[3]]
[1] "foo1/foo2"

Escaping the forward-slash (that shouldn't be a special character anyways) yields the same results.

The desired output should look like this:

[[1]]
[1] "foo1" "foo2"

[[2]]
[1] "foo1 / foo2"  

[[3]]
[1] "foo1/foo2"

Apologies if this is very basic, but I couldn't find an explanation for this specific behaviour.


Solution

  • Your original regex does not work for you because the spaces after / are still matched. (?!/) matches any space that is not directly followed with a /, but not if it is preceded with /.

    You might try (?<!/) (?!/) - see this regex demo, but this will still match on spaces that are before / or after / .

    In order to match any 1+ whitespace chars but the cases when the whitespaces enclose a / char, you may use

    strsplit(c("foo1 foo2", "foo1 / foo2", "foo1/foo2"), '\\s*/\\s*(*SKIP)(*F)|\\s+', perl=TRUE)
    

    The \s*/\s*(*SKIP)(*F)|\s+ (see its online demo) pattern matches

    • \s*/\s*(*SKIP)(*F) - consumes 1+ whitespaces, / and then 1+ whitespaces and discards the match
    • | - or
    • \s+ - consumes 1+ whitespaces