Search code examples
javaregexregex-negationregex-lookaroundslookbehind

Splitting a string while keeping the delimiters except escaped ones (regex)


If I have a String which is delimited by a character, let's say this:

a-b-c

and I want to keep the delimiters, I can use look-behind and look-ahead to keep the delimiters themselves, like:

string.split("((?<=-)|(?=-))");

which results in

  • a
  • -
  • b
  • -
  • c

Now, if one of the delimiters is escaped, like this:

a-b\-c

And I want to honor the escape, I figured out to use a regex like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\-))))  

ergo

string.split("((?<=-(?!(?<=\\\\-)))|(?=-(?!(?<=\\\\-))))"):

Now, this works and results in:

  • a
  • -
  • b\-c

(The backslash I'd later remove with string.replace("\\", "");, I haven't found a way to include that in the regex)

My Problem is one of understanding.
The way I understood it, the regex would be, in words,

split ((if '-' is before (unless ('\-' is before))) or (if '-' is after (unless ('\-' is before))))

Why shouldn't the last part be "unless \ is before"? If '-' is after, that means we're between '\' and '-', so only \ should be before, not \\-, but it doesn't work if I change the regex to reflect that like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\))))  

Result: a, -, b\, -c

What is the reason for this? Where is my error in reasoning?


Solution

  • Why shouldn't the last part be "unless \ is before"?

    In

    (?=-(?!(?<=\\-)))) 
        ^here
    

    cursor is after - so "unless \ is before" will always be false since we always have - before current position.


    Maybe easier regex would be

    (?<=(?<!\\\\)-)|(?=(?<!\\\\)-)

    • (?<=(?<!\\\\)-) will check if we are after - that has no \ before.
    • (?=(?<!\\\\)-)will check if we are before - that has no \ before.