Search code examples
javaregexjava-21word-boundary

Java 21 Regex Word-boundary matcher Unicode change


I noticed that the semantics of the Java Regex word-boundary matcher \b changed significantly with Java 21. Up until (at least) Java 17, it used to support Unicode, so the regex a\b.* DID NOT match the string "aß".

Apparently with Java 21 it is now defined in terms of the \w character class, which by default is not Unicode-enabled. So now a\b.* suddenly DOES match "aß". The only way I can see to "fix" \b is to enable the UNICODE_CHARACTER_CLASS flag, but that of course changes ALL the character classes, which is also different from the pre-Java-21 behavior.

Weirdly, I cannot find any information on this breaking change. Nothing in the Java 21 release notes, and various googling attempts did not yield anything helpful. For such breaking changes of essential core libs I would at least expect a big fat warning and also a feature flag to re-enable the old behavior. Anyone know anything about that?

MWE:

echo 'System.out.println("aß".matches("a\\b.*"))' | /usr/lib/jvm/java-17-openjdk/bin/jshell -> false

vs.

echo 'System.out.println("aß".matches("a\\b.*"))' | /usr/lib/jvm/java-21-openjdk/bin/jshell -q -> true


Solution

  • Thanks to @Sweeper for digging it out. Here is the original bug report:

    https://bugs.openjdk.org/browse/JDK-8282129

    So the change was released with Java 19, and it is in fact mentioned in the release notes: https://www.oracle.com/java/technologies/javase/19all-relnotes.html

    From what I gather there, there is no feature flag for it because the bug's compatibility risk was classified as "low", due to the following opinion:

    The existing behavior of the \b metacharacter in Java regex strings is longstanding and changing it may impact existing regular expressions that rely on this inconsistent (with respect to Unicode characters) behavior. However, the use of \b is less common and code that focuses on ASCII-encoded data or similar will be unaffected.

    So effectively everyone who depends on the old (admittedly inconsistent) behavior will have problems to deal with :-/