Search code examples
javacharacter-encoding

Unable to decode a single Japanese kanji character using character set iso-2022-jp


I am processing client emails written in Japanese. Some have HTML bodies encoded using character set iso-2022-jp. I found a strange scenario where I am unable to decode a single Japanese kanji character using character set iso-2022-jp.

Sample code to reproduce the issue:

final String z = "髙";
final Charset charset = Charset.forName("iso-2022-jp");
final byte[] byteArr = z.getBytes(charset);
final String z2 = new String(byteArr, charset);
System.out.println(z);  // prints "髙"
System.out.println(z2);  // prints "?"

If I use charset "utf-8", it works fine.

To be clear, I am absolutely sure the character here is Unicode character U+9AD9. This is a common character in Japanese text, e.g., Takashimaya dept store: 髙島屋. The above code will correctly encode/decode the last two chars: 島 and 屋.

I am 99.99% sure I am using the decode/encode API incorrectly. What am I doing wrong?

Finally, I am debugging with IntelliJ 2020 on Windows 7 using the latest JDK 11. I also deal with Japanese text on a regular basis, so I know my fonts are setup OK.

Update

Thank you for the very helpful comments.

  1. I did not notice this kanji is a "variant" of the more common 高 (U+9AD8). My fonts were too small to notice.
  2. The client office address is the Takashimaya building in Nihombashi, Tokyo. Thus, 髙島屋 appears in the email footer.
  3. It appears the original email was sent using a combination of Microsoft Outlook and Exchange. The HTML body has this head tag: <meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp"> Unbelievable!
  4. The workaround in Java is to override the MIME content type from iso-2022-jp to x-windows-iso2022jp. (Hat tip to Anish B!)

Solution

  • From Oracle's Supported Encodings Documentation :

    You have to use x-windows-iso2022jp encoding charset. It is a Variant ISO-2022-JP (MS932 based)

    Try out this code:

    public class App {
    
        public static void main(String[] args) {
            final String z = "髙";
            final Charset charset = Charset.forName("x-windows-iso2022jp");
            final byte[] byteArr = z.getBytes(charset);
            final String z2 = new String(byteArr, charset);
            System.out.println(z);
            System.out.println(z2); 
        }
    
    }
    

    Output using AdoptedOpenJDK 11:

    enter image description here