Search code examples
javaregexjava-11java-16

Regex \p{Cs} not matching symbol in Java 16


Does anyone know why the regex \p{Cs} does not match the symbol 񡠼 in Java 16? It used to match it in Java 11.

Java 11

jshell 
|  Welcome to JShell -- Version 11.0.7
|  For an introduction type: /help intro

jshell> import java.util.regex.*

jshell> var text = new StringBuilder().appendCodePoint(55622).appendCodePoint(56380)
text ==> 񡠼

jshell> Pattern.compile("\\p{Cs}").matcher(text).find()
$3 ==> true

Java 16

INFO: Created user preferences directory.
|  Welcome to JShell -- Version 16.0.1
|  For an introduction type: /help intro

jshell> import java.util.regex.*

jshell> var text = new StringBuilder().appendCodePoint(55622).appendCodePoint(56380)
text ==> 񡠼

jshell> Pattern.compile("\\p{Cs}").matcher(text).find()
$3 ==> false

Solution

  • First, your “symbol 񡠼” has the codepoint 399420, which is not assigned by the Unicode standard (yet), so if you are seeing something useful here, it’s a non-standard behavior of your system.

    The way you construct the string, is not correct, semantically, but happens to create the intended string. For historic reasons, Java’s API is centered around a UTF-16 representation.

    When you define the symbol using two surrogate characters, i.e.

    var text = "\uD946\uDC3C";
    System.out.println(text.codePointAt(0));
    

    you’ll get

    399420
    

    On the other hand, when you use

    var text = new StringBuilder().appendCodePoint(399420);
    text.chars().forEach(c -> System.out.printf("\\u%04X", c));
    System.out.println();
    

    you’ll get

    \uD946\uDC3C
    

    In other words, the sequence of the two surrogate UTF-16 char units \uD946, \uDC3C is equivalent to the single codepoint 399420. Conceptionally, the string consists of the single codepoint, in other words,

    System.out.println(text.codePointCount(0, text.length()) + " codepoint(s)");
    System.out.println(text.codePointAt(0));
    System.out.println("type " + Character.getType(text.codePointAt(0)));
    

    will print

    1 codepoint(s)
    399420
    type 0
    

    in either case. The type 0 indicates that this codepoint is unassigned.

    You are using appendCodePoint for appending two UTF-16 units to the StringBuilder, but since this method treats codepoints of the BMP the same way as UTF-16 units, it happens to construct the same string, too.

    Since the category of the codepoint is “unassigned”, it shouldn’t be “surrogate”, so \p{Cs} should never find a match here. When processing a valid Unicode string, you should never encounter this category, as it can only match dangling surrogate characters which can not be interpreted as a codepoint outside the BMP.

    But there’s the bug JDK-8247546, Pattern matching does not skip correctly over supplementary characters. Before Java 16, the regex engine did process the codepoint at location zero correctly, but advanced only one char position, so it found a dangling surrogate character when looking at char position 1 alone.

    We can verify it using

    var m = Pattern.compile("\\p{Cs}").matcher(text);
    if(m.find()) {
        System.out.println("found a match at " + m.start());
    }
    

    which prints “found a match at 1” prior to JDK 16, which is wrong, as position 1 should be skipped when there’s a single codepoint at char positions 0 and 1.

    This bug has been fixed in JDK 16. So now, the string is treated as a single codepoint of the “unassigned” category. Of course, this category might change again in the future. But it should never be “surrogate”.