Search code examples
javaregexverbose

Error compiling a verbose Java regex with character class and word boundary


Why does this pattern fail to compile :

Pattern.compile("(?x)[ ]\\b");

Error

ERROR java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 8
(?x)[ ]\b
        ^
at java_util_regex_Pattern$compile.call (Unknown Source)

While the following equivalent ones work?

Pattern.compile("(?x)\\ \\b");
Pattern.compile("[ ]\\b");
Pattern.compile(" \\b");

Is this a bug in the Java regex compiler, or am I missing something? I like to use [ ] in verbose regex instead of backslash-backslash-space because it saves some visual noise. But apparently they are not the same!

PS: this issue is not about backslashes. It's about escaping spaces in a verbose regex using a character class containing a single space [ ] instead of using a backslash.

Somehow the combination of verbose regex (?x) and a character class containing a single space [ ] throws the compiler off and makes it not recognize the word boundary escape \b


Tested with Java up to 1.8.0_151


Solution

  • This is a bug in Java's peekPastWhitespace() method in the Pattern class. Tracing this entire issue down... I decided to take a look at OpenJDK 8-b132's Pattern implementation. Let's start hammering this down from the top:

    1. compile() calls expr() on line 1696
    2. expr() calls sequence() on line 1996
    3. sequence() calls clazz() on line 2063 since the case of [ was met
    4. clazz() calls peek() on line 2509
    5. peek() calls peekPastWhitespace() on line 1830 since if(has(COMMENTS)) evaluates to true (due to having added the x flag (?x) at the beginning of the pattern)
    6. peekPastWhitespace() (posted below) skips all spaces in the pattern.

    peekPastWhitespace()

    private int peekPastWhitespace(int ch) {
        while (ASCII.isSpace(ch) || ch == '#') {
            while (ASCII.isSpace(ch))
                ch = temp[++cursor]
            if (ch == '#') {
                ch = peekPastLine();
            }
        }
        return ch;
    }
    

    The same bug exists in the parsePastWhitespace() method.

    Your regex is being interpreted as []\\b, which is the cause of your error because \b is not supported in a character class in Java. Moreover, once you fix the \b issue, your character class also doesn't have a closing ].

    What you can do to fix this problem:

    1. \\ As the OP mentioned, simply use double backslash and space
    2. [\\ ] Escape the space within the character class so that it gets interpreted literally
    3. [ ](?x)\\b Place the inline modifier after the character class