print(bytes('ba', 'utf-16'))
Result :
b'\xff\xfeb\x00a\x00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000
in binary and i understand there are 16 bits here x00a
means x00 = 00000000
and a = 01000001
so both gives x00a
it is clear to my mind like this but here is the confusion:
\xff\xfeb
1 - What is this ?????????
2 - Why fe
??? it should be x00
i have read a lot of wikipedia articles but it is still not clear
You have,
b'\xff\xfeb\x00a\x00'
This is what you asked for, it has three characters.
b'\xff\xfe' # 0xff 0xfe
b'b\x00' # 0x62 0x00
b'a\x00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b'
and 'a'
look like they're hexadecimal digits, but they're not.
If you don't want the BOM, you can use utf-16le
or utf-16be
.
>>> bytes('ba', 'utf-16le')
b'b\x00a\x00'
>>> bytes('ba', 'utf-16be')
b'\x00b\x00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.