I am processing client emails written in Japanese. Some have HTML bodies encoded using character set iso-2022-jp. I found a strange scenario where I am unable to decode a single Japanese kanji character using character set iso-2022-jp.
Sample code to reproduce the issue:
final String z = "髙";
final Charset charset = Charset.forName("iso-2022-jp");
final byte[] byteArr = z.getBytes(charset);
final String z2 = new String(byteArr, charset);
System.out.println(z); // prints "髙"
System.out.println(z2); // prints "?"
If I use charset "utf-8", it works fine.
To be clear, I am absolutely sure the character here is Unicode character U+9AD9. This is a common character in Japanese text, e.g., Takashimaya dept store: 髙島屋. The above code will correctly encode/decode the last two chars: 島 and 屋.
I am 99.99% sure I am using the decode/encode API incorrectly. What am I doing wrong?
Finally, I am debugging with IntelliJ 2020 on Windows 7 using the latest JDK 11. I also deal with Japanese text on a regular basis, so I know my fonts are setup OK.
Thank you for the very helpful comments.
<meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp">
Unbelievable!iso-2022-jp
to x-windows-iso2022jp
. (Hat tip to Anish B!)From Oracle's Supported Encodings Documentation :
You have to use
x-windows-iso2022jp
encoding charset. It is a Variant ISO-2022-JP (MS932 based)
Try out this code:
public class App {
public static void main(String[] args) {
final String z = "髙";
final Charset charset = Charset.forName("x-windows-iso2022jp");
final byte[] byteArr = z.getBytes(charset);
final String z2 = new String(byteArr, charset);
System.out.println(z);
System.out.println(z2);
}
}
Output using AdoptedOpenJDK 11: