This question is related to How to decode bytes in GB18030 correctly.
I would like to decode an array of bytes which are encoded with GBK, but found that Python and Java behave differently sometimes.
ch = b'\xA6\xDA'
print(ch.decode('gbk'))
It raises an error:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 0: illegal multibyte sequence
Java is able to decode it.
byte[] data = {(byte) 0xA6, (byte) 0xDA};
String s = new String(data, Charset.forName("GBK"));
System.out.println(s);
It seems that Python and Java adopt different implementations for GBK
, right?
It seems that Python and Java adopt different implementations for GBK, right?
Yes. GBK is a ambiguous encoding to some extent, and thus different platforms may adopt different implementations.
As for Python (CPython here), the mapping between GBK to Unicode is defined in mappings_cn.h, which is a strict implementation of CP936, where some characters (such as 0xA6DA
) in non-Chinese regions are not defined.
In the contrast, GBK in Java (OpenJDK 17 here), in fact, is an extended CP936 where some extra non-Chinese characters are included in order to follow GB18030/MS936. For example, 0xA6DA
is mapped to Unicode U+E78E
(although it is a private-user-area one).