Search code examples
javaregexjava-8regex-lookarounds

Java-8 regex negative lookbehind with `\R`


While answering another question, I wrote a regex to match all whitespace up to and including at most one newline. I did this using negative lookbehind for the \R linebreak matcher:

((?<!\R)\s)*

Afterwards I was thinking about it and I said, oh no what if there is a \r\n? Surely it will grab the first linebreakish character \r and then I will be stuck with a spurious \n on the front of my next string, right?

So I went back to test (and presumably fix) it. However, when I tested the pattern, it matched an entire \r\n. It does not match only the \r leaving \n as one might expect.

"\r\n".matches("((?<!\\R)\\s)*"); // true, expected false

However, when I use the "equivalent" pattern mentioned in the documentation for \R, it returns false. So is that a bug with Java, or is there a valid reason why it matches?


Solution

  • Realization #1. The documentation is wrong

    Source: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

    Here it says:

    Linebreak matcher

    ...is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

    However, when we try using the "equivalent" pattern, it returns false:

    String _R_ = "\\R";
    System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true
    
    // using "equivalent" pattern
    _R_ = "\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]";
    System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // false
    
    // now make it atomic, as per sln's answer
    _R_ = "(?>"+_R_+")";
    System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true
    

    So the Javadoc should really say:

    ...is equivalent to (?<!\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

    Update March 9, 2017 per Sherman at Oracle JDK-8176029:

    "api doc is NOT wrong, the implementation is wrong (which fails to backtracking "0x0d+next.match()" when "0x0d+0x0a + next.match()" fails)"


    Realization #2. Lookbehinds don't only look backwards

    Despite the name, a lookbehind is not only able to look backwards, but can include and even jump over the current position.

    Consider the following example (from rexegg.com):

    "_12_".replaceAll("(?<=_(?=\\d{2}_))\\d+", "##"); // _##_
    

    "This is interesting for several reasons. First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore. That's acrobatic."

    What this means for our example of \R is that even though our current position may be \n, that will not stop the lookbehind from recognizing that its \r is followed by \n, then binding the two together as an atomic group, and consequently refusing to recognize the \r part behind the current position as a separate match.

    Note: for simplicity sake I have used terms such as "our current position is \n", however this is not an exact representation of what occurs internally.