Why do Python and Java behave differently when decoding GBK

This question is related to How to decode bytes in GB18030 correctly.

I would like to decode an array of bytes which are encoded with GBK, but found that Python and Java behave differently sometimes.

ch = b'\xA6\xDA'
print(ch.decode('gbk'))

It raises an error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 0: illegal multibyte sequence

Java is able to decode it.

byte[] data = {(byte) 0xA6, (byte) 0xDA};
String s = new String(data, Charset.forName("GBK"));
System.out.println(s);

It seems that Python and Java adopt different implementations for GBK, right?

Solution

It seems that Python and Java adopt different implementations for GBK, right?

Yes. GBK is a ambiguous encoding to some extent, and thus different platforms may adopt different implementations.

As for Python (CPython here), the mapping between GBK to Unicode is defined in mappings_cn.h, which is a strict implementation of CP936, where some characters (such as 0xA6DA) in non-Chinese regions are not defined.

In the contrast, GBK in Java (OpenJDK 17 here), in fact, is an extended CP936 where some extra non-Chinese characters are included in order to follow GB18030/MS936. For example, 0xA6DA is mapped to Unicode U+E78E (although it is a private-user-area one).