Search code examples
javastringunicodeutf-8windows-1252

Different codepoints for same character in MacOS and Windows


I have a small piece of code in which I am checking the codepoint for the the character Ü.

Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));

I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.

Output on MacOS

en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220

Output on Windows

en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195

I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü is 220. For String glyph = "Ü"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.

If I replace String glyph = "Ü"; with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220. Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?


Solution

  • 195 is 0xC3 in hex.

    In UTF-8, Ü is encoded as bytes 0xC3 0x9C.

    System.getProperty("file.encoding") says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println() is outputting glyph ?? (note 2 ?, meaning 2 chars are present), and that you are able to decode the raw string bytes using the UTF-8 Charset, proves this.

    glyph should have a single char whose value is 0x00DC, not 2 chars whose values are 0x00C3 0x009C. getCodepointAt(0) is returning 0x00C3 (195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C get decoded as characters 0x00C3 0x009C instead of as character 0x00DC.

    You need to specify the actual file encoding when running Java, eg:

    java -Dfile.encoding=UTF-8 ...