I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string:
>>> b1 = 'abc'.encode('utf-16')
But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray:
>>> [hex(b) for b in b1]
['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
Since the vendor's device is not expecting the [0xff, 0xfe], I can strip it off...
>>> b2 = 'abc'.encode('utf-16')[2:]
>>> [hex(b) for b in b2]
['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
... which is what I want.
But what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
>>> b1.decode('utf-16') == b2.decode('utf-16')
True
So my two intertwined questions:
This is the byte order mark. It's a prefix to a UTF document that indicates what endianness the document uses. It does this by encoding the code point 0xFEFF
in the byte order - in this case, little endian (less significant byte first). Anything trying to read it the other way around, in big endian (more significant byte first), will read the first character as 0xFFFE
, which is a code point that is specifically not a valid character, informing the reader it needs to error or switch endianness for the rest of the document.