Say I want to split a string at any separator char, but not escaped ones, I can usually use a negative lookbehind and string.split(regex).
For example:
const regex = /(?<!\\)\,/;
'abc,def'.split(regex);
'abc\\,def'.split(regex);
splits at the ,
in abc,def
, but not in abc\,def
. This is fine!
But if the separator character itself a backslash, the negative lookbehind seems to not work as expected:
const regex = /(?<!\\)\\/;
'abc\\def'.split(regex);
'abc\\\\def'.split(regex);
splits both at the first \
in abc\def
AND in abc\\def
.
Naively I would have expected that the negative lookbehind will not match a \
preceded by a \
.
See: https://regex101.com/r/ozkZR1/1
How can I achieve a string.split(regex) at any non-escaped character that doesn't fall apart with special characters like a backslash or a line-break (one should be able to escape them too)?
The solution was to reverse the operation:
Instead of looking for the delimiters, I could look for the delimited character sequences. So in case of a ,
delimiter I would look for: ((\\,)|[^,])([^,]*?(\\,)?)*
: Either an escaped comma or a non-comma character, followed by any number (potentially empty) group of non-commas (reluctant, so it doesn't catch the \
of an escape) which is followed by an optional escaped comma.
let separator = ','; // get from sanitized input
separator = separator === '\\' ? '\\\\' : separator;
const groups = new RegExp(`((\\\\${separator})|[^${separator}])([^${separator}]*?(\\\\${separator})?)+`, 'g');
let columns = line.match(groups);
This works for ,
as well as for \
as separators and will not split on \,
and \\
respectively.
The hardest part of that expression was to get all the escapes right.