Why does this pattern fail to compile :
Pattern.compile("(?x)[ ]\\b");
Error
ERROR java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 8
(?x)[ ]\b
^
at java_util_regex_Pattern$compile.call (Unknown Source)
While the following equivalent ones work?
Pattern.compile("(?x)\\ \\b");
Pattern.compile("[ ]\\b");
Pattern.compile(" \\b");
Is this a bug in the Java regex compiler, or am I missing something? I like to use [ ]
in verbose regex instead of backslash-backslash-space because it saves some visual noise. But apparently they are not the same!
PS: this issue is not about backslashes. It's about escaping spaces in a verbose regex using a character class containing a single space [ ]
instead of using a backslash.
Somehow the combination of verbose regex (?x)
and a character class containing a single space [ ]
throws the compiler off and makes it not recognize the word boundary escape \b
Tested with Java up to 1.8.0_151
This is a bug in Java's peekPastWhitespace()
method in the Pattern
class. Tracing this entire issue down... I decided to take a look at OpenJDK 8-b132's Pattern
implementation. Let's start hammering this down from the top:
compile()
calls expr()
on line 1696expr()
calls sequence()
on line 1996sequence()
calls clazz()
on line 2063 since the case of [
was metclazz()
calls peek()
on line 2509peek()
calls peekPastWhitespace()
on line 1830 since if(has(COMMENTS))
evaluates to true
(due to having added the x
flag (?x)
at the beginning of the pattern)peekPastWhitespace()
(posted below) skips all spaces in the pattern.private int peekPastWhitespace(int ch) {
while (ASCII.isSpace(ch) || ch == '#') {
while (ASCII.isSpace(ch))
ch = temp[++cursor]
if (ch == '#') {
ch = peekPastLine();
}
}
return ch;
}
The same bug exists in the parsePastWhitespace()
method.
Your regex is being interpreted as []\\b
, which is the cause of your error because \b
is not supported in a character class in Java. Moreover, once you fix the \b
issue, your character class also doesn't have a closing ]
.
What you can do to fix this problem:
\\
As the OP mentioned, simply use double backslash and space[\\ ]
Escape the space within the character class so that it gets interpreted literally[ ](?x)\\b
Place the inline modifier after the character class