Search code examples
javacharutf-16println

System.out.println() behavior on surrogate pair char[]


When I execute the code:

char[] c1 = {'a','b'};
char[] c2 = Character.toChars(0x10437);
System.out.println(c1);
System.out.println(c2);

I get:

ab
𐐷

By reading the documentation of Character.toChars, I get that c2 is a char array of size 2, which is the surrogate pair that represents 0x10437 in UTF-16.

What I don't understand is that c1 and c2 are both char arrays of length 2, then how does println know that c1 represents 2 characters ('a' and 'b') while c2 represents only 1 character 𐐷? Why println doesn't recognize c2 as two characters?


Solution

  • The javadoc for Character.tochars(int) states (with my emphasis added):

    Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.

    So, as you point out, your statement char[] c2 = Character.toChars(0x10437); causes the high-surrogate of that code point to be placed in c2[0], and the low-surrogate of the code point to be placed in c2[1];

    You can display the hex values of that surrogate pair:

    System.out.printf("High: 0x%X%n", (int)c2[0]);
    System.out.printf("Low:  0x%X%n", (int)c2[1]);
    

    This is the output from that code:

    High = 0xD801
    Low = 0xDC37
    

    Now println() "knows" to treat those two array elements as a surrogate pair for rendering a single character because surrogate values are in specific ranges:

    • The range U+D800 to U+DBFF is exclusively for high surrogates.
    • The range U+DC00 to U+DFFF is exclusively for low surrogates.

    (Somewhat confusingly, high surrogate values are lower than low surrogate values.) See section 3.8 Surrogates of the Unicode Standard for the rules, but this is the crucial point: "Isolated surrogate code units have no interpretation on their own". So when println() encounters the value 0xD801 it knows that it can only be a high surrogate (and nothing else), and that it must be followed by a low surrogate in the range U+DC00 to U+DFFF. It can then construct a code point from that surrogate pair, and prints the character represented by that code point.

    A lot of confusion arises from the multiple meanings of the word "character" in Java, Unicode and plain English. Generally we think of a one-to-one mapping between some digital character value (i.e. code point) and its visual representation, so the Unicode character U+0077 represents the visual character w, and the Java character (char) 'w'. But in your example, the surrogate pair U+D801 and 0xDC37 do not represent individual characters at all, and they are only meaningful when combined together to form the code point (character) x10437, which is represented by the visual character 𐐷.