Search code examples
pythonjavaunicodecharacter-encoding

Why do Python and Java behave differently when decoding GBK


This question is related to How to decode bytes in GB18030 correctly.

I would like to decode an array of bytes which are encoded with GBK, but found that Python and Java behave differently sometimes.

ch = b'\xA6\xDA'
print(ch.decode('gbk'))

It raises an error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 0: illegal multibyte sequence

Java is able to decode it.

byte[] data = {(byte) 0xA6, (byte) 0xDA};
String s = new String(data, Charset.forName("GBK"));
System.out.println(s);

It seems that Python and Java adopt different implementations for GBK, right?


Solution

  • It seems that Python and Java adopt different implementations for GBK, right?

    Yes. GBK is a ambiguous encoding to some extent, and thus different platforms may adopt different implementations.

    As for Python (CPython here), the mapping between GBK to Unicode is defined in mappings_cn.h, which is a strict implementation of CP936, where some characters (such as 0xA6DA) in non-Chinese regions are not defined.

    In the contrast, GBK in Java (OpenJDK 17 here), in fact, is an extended CP936 where some extra non-Chinese characters are included in order to follow GB18030/MS936. For example, 0xA6DA is mapped to Unicode U+E78E (although it is a private-user-area one).