Search code examples
javaencodingcharacter-encodingencode

java charset decode issue


I'm trying to decode a char · using charset GB2312 in java

this char contained in GB2312, the positional code is a1a4 check here

code:

public static void main(String[] _args) throws Exception {
    String str="a1a4:· a5f6:ヶ a8c5:ㄅ";          
    ByteBuffer bf=readToByteBuffer(new ByteArrayInputStream(str.getBytes()));
    System.out.println(Charset.forName("GB2312").decode(bf).toString());
}
private static final int bufferSize = 0x20000;
static ByteBuffer readToByteBuffer(InputStream inStream) throws IOException {
    byte[] buffer = new byte[bufferSize];
    ByteArrayOutputStream outStream = new ByteArrayOutputStream(bufferSize);
    int read;
    while (true) {
        read = inStream.read(buffer);
        if (read == -1)
            break;
        outStream.write(buffer, 0, read);
    }
    ByteBuffer byteData = ByteBuffer.wrap(outStream.toByteArray());
    return byteData;
}

The code above output results for:

a1a4:? a5f6:ヶ a8c5:ㄅ

I don't understand why can't decode a1a4?


Solution

  • In my browser, your string d has its fifth character encoded as 0xB7, which is MIDDLE DOT, not KATAKANA MIDDLE DOT. However, according to the same database you mentioned, that code point is not available in the GB2312 character set. Likewise, you can see that neither MIDDLE DOT nor an encoding of 0xB7 are listed as being part of GB2312.

    I think the problem here is with the characters in your input string, not in the CharsetDecoder provided by your JRE.