I have a small piece of code in which I am checking the codepoint for the the character Ü
.
Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));
I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.
Output on MacOS
en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220
Output on Windows
en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195
I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü
is 220
.
For String glyph = "Ü";
why do I get codepoint as 195
on Windows? As per my understanding glyph
should have been rendered properly and the codepoint should have been 220
since it is defined in Windows-1252.
If I replace String glyph = "Ü";
with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8"));
then glyph
is rendered correctly and codepoint value is 220
.
Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?
195 is 0xC3 in hex.
In UTF-8, Ü
is encoded as bytes 0xC3 0x9C
.
System.getProperty("file.encoding")
says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println()
is outputting glyph ??
(note 2 ?
, meaning 2 char
s are present), and that you are able to decode the raw string bytes using the UTF-8 Charset
, proves this.
glyph
should have a single char
whose value is 0x00DC
, not 2 char
s whose values are 0x00C3 0x009C
. getCodepointAt(0)
is returning 0x00C3
(195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C
get decoded as characters 0x00C3 0x009C
instead of as character 0x00DC
.
You need to specify the actual file encoding when running Java, eg:
java -Dfile.encoding=UTF-8 ...