Search code examples
pythonencodingendiannessutf-16

What is this hexadecimal in the utf16 format?


print(bytes('ba', 'utf-16'))

Result :

b'\xff\xfeb\x00a\x00'

I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:

\xff\xfeb

1 - What is this ?????????

2 - Why fe ??? it should be x00

i have read a lot of wikipedia articles but it is still not clear


Solution

  • You have,

    b'\xff\xfeb\x00a\x00'
    

    This is what you asked for, it has three characters.

    b'\xff\xfe' # 0xff 0xfe
    b'b\x00'    # 0x62 0x00
    b'a\x00'    # 0x61 0x00
    

    The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.

    It is just confusing to read because the 'b' and 'a' look like they're hexadecimal digits, but they're not.

    If you don't want the BOM, you can use utf-16le or utf-16be.

    >>> bytes('ba', 'utf-16le')
    b'b\x00a\x00'
    >>> bytes('ba', 'utf-16be')
    b'\x00b\x00a'
    

    The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.