Search code examples
javascriptregextypescriptescapingnegative-lookbehind

Negative lookbehind to not match escaped characters, fails on escaped backslash


Say I want to split a string at any separator char, but not escaped ones, I can usually use a negative lookbehind and string.split(regex).

For example:

const regex = /(?<!\\)\,/;
'abc,def'.split(regex); 
'abc\\,def'.split(regex); 

splits at the , in abc,def, but not in abc\,def. This is fine!

But if the separator character itself a backslash, the negative lookbehind seems to not work as expected:

const regex = /(?<!\\)\\/;
'abc\\def'.split(regex); 
'abc\\\\def'.split(regex); 

splits both at the first \ in abc\def AND in abc\\def.

Naively I would have expected that the negative lookbehind will not match a \ preceded by a \.

See: https://regex101.com/r/ozkZR1/1

How can I achieve a string.split(regex) at any non-escaped character that doesn't fall apart with special characters like a backslash or a line-break (one should be able to escape them too)?


Solution

  • The solution was to reverse the operation:

    Instead of looking for the delimiters, I could look for the delimited character sequences. So in case of a , delimiter I would look for: ((\\,)|[^,])([^,]*?(\\,)?)*: Either an escaped comma or a non-comma character, followed by any number (potentially empty) group of non-commas (reluctant, so it doesn't catch the \ of an escape) which is followed by an optional escaped comma.

    let separator = ','; // get from sanitized input
    separator = separator === '\\' ? '\\\\' : separator;
    const groups = new RegExp(`((\\\\${separator})|[^${separator}])([^${separator}]*?(\\\\${separator})?)+`, 'g');
    let columns = line.match(groups);
    

    This works for , as well as for \ as separators and will not split on \, and \\ respectively.

    The hardest part of that expression was to get all the escapes right.