I am confused by a simple lookahead behaviour in strsplit in R v3.6.2: when I try to match a space () not followed by a forward-slash (
/
) the regex behaves oddly.
The below attempt correctly doesn't consume the forward-slash but still splits at the space afterwards. The output is the same with patterns: ' (?!/ )'
and ' (?!/ *)'
, also with other wildcards .
and ?
.
strsplit(c("foo1 foo2", "foo1 / foo2", "foo1/foo2"), ' (?!/)', perl = T)
[[1]]
[1] "foo1" "foo2"
[[2]]
[1] "foo1 /" "foo2"
[[3]]
[1] "foo1/foo2"
This is all the more confusing because if I negate a positive lookahead, strsplit simply won't split anything. This persists with different patterns as above.
strsplit(c("foo1 foo2", "foo1 / foo2", "foo1/foo2"), ' ^(?=/)', perl = T)
[[1]]
[1] "foo1 foo2"
[[2]]
[1] "foo1 / foo2"
[[3]]
[1] "foo1/foo2"
Escaping the forward-slash (that shouldn't be a special character anyways) yields the same results.
The desired output should look like this:
[[1]]
[1] "foo1" "foo2"
[[2]]
[1] "foo1 / foo2"
[[3]]
[1] "foo1/foo2"
Apologies if this is very basic, but I couldn't find an explanation for this specific behaviour.
Your original regex does not work for you because the spaces after /
are still matched. (?!/)
matches any space that is not directly followed with a /
, but not if it is preceded with /
.
You might try (?<!/) (?!/)
- see this regex demo, but this will still match on spaces that are before /
or after /
.
In order to match any 1+ whitespace chars but the cases when the whitespaces enclose a /
char, you may use
strsplit(c("foo1 foo2", "foo1 / foo2", "foo1/foo2"), '\\s*/\\s*(*SKIP)(*F)|\\s+', perl=TRUE)
The \s*/\s*(*SKIP)(*F)|\s+
(see its online demo) pattern matches
\s*/\s*(*SKIP)(*F)
- consumes 1+ whitespaces, /
and then 1+ whitespaces and discards the match|
- or\s+
- consumes 1+ whitespaces