Search code examples
pythonencodingutf-8endiannessucs2

unicode endian puzzled me


i edit three files which have same content "你"(is you in english) in it in three different forms--gbk\utf-8\ucs-2 with gedit named "ok1,ok2,ok3".

>>> f1 = open('ok1', 'rb').read()
>>> f2 = open('ok2', 'rb').read()
>>> f3 = open('ok3', 'rb').read()
>>> f1
'\xc4\xe3\n'
>>> f2
'\xe4\xbd\xa0\n'
>>> f3
'`O\n\x00'
>>> hex(ord("`"))
'0x60'
>>> hex(ord("O")) 
'0x4f'

in fact f3 is '\x60\x4f', but the following output confused me

>>> '\xe4\xbd\xa0'.decode("utf-8")
u'\u4f60'
>>> '\xc4\xe3'.decode("gbk")
u'\u4f60'
>>> 

why only there is endian problem in ucs-2(or say unicode) ,not in utf-8,not in gbk?


Solution

  • UTF-8 and GBK store data in a sequence of bytes. It is strongly defined which byte value comes after which in these encodings. This byte order does not change with the architecture used in coding, transmission or decoding.

    On the other hand, UCS-2 or the new UTF-16 store data in sequences of 2-bytes. The order of individual bytes within these 2-byte tokens is the endianness and it depends on the underlying machine architecture. Systems must have an agreement on how to identify the endianness of tokens before communicating with data encoded in UCS-2.

    In your case, Unicode point U+4F60 is coded in UCS-2 as a single 2-byte token 0x4F60. Since your machine puts the least significant byte before the most significant one in memory alignment, the sequence ('0x60', '0x4F') has been put into the file. Thus, file read will yield the bytes in this order.

    Python can still decode this data correctly since it will read the bytes in correct order before forming the 2-byte token:

    >>> '`O\n\x00'.decode('utf-16')
    u'\u4f60\n'