python unicode utf-8 utf-16 byte-order-mark

length of string in python3.5 with different encode

I tried this in python to get the length of a string in bytes.

>>> s = 'a'
>>> s.encode('utf-8')
b'a'
>>> s.encode('utf-16')
b'\xff\xfea\x00'
>>> s.encode('utf-32')
b'\xff\xfe\x00\x00a\x00\x00\x00'
>>> len(s.encode('utf-8'))
1
>>> len(s.encode('utf-16'))
4
>>> len(s.encode('utf-32'))
8

utf-8 uses one byte to store an ascii character, as expected, but why does utf-16 use 4 bytes? What is len() measuring exactly?

Solution

TL;DR:

UTF-8 : 1 byte 'a'
UTF-16: 2 bytes 'a' + 2 bytes BOM
UTF-32: 4 bytes 'a' + 4 bytes BOM

UTF-8 is a variable length encoding, and characters may be encoded with lengths between 1 to 4 bytes. It was designed to match ASCII for the first 128 characters, so an 'a' is a single byte width.
UTF-16 is a variable-length encoding; code points are encoded with one or two 16-bit code units (i.e. 2 or 4 bytes), an 'a' is 2 bytes wide.
UTF-32 is fixed width, exactly 32 bits per code point, each and every character is 4 bytes wide so an 'a' is 4 bytes wide.

For the lengths of an "a" encoded in UTF-8, UTF-16, UTF-32, you may expect to see results of 1, 2, 4 respectively. The actual results of 1, 4, 8 are inflated because in the last two cases the output is including the BOM - that \xff\xfe thing is the byte order mark, used to indicate the endianness of the data.

The unicode standard permits the BOM in UTF-8, but neither requires nor recommends its use (it has no meaning there), which is why you don't see any BOM in the first example. The UTF-16 BOM is 2 bytes wide and the UTF-32 BOM is 4 bytes wide (actually it's just the same as a UTF-16 BOM, plus some padding nulls).

>>> 'a'.encode('utf-16')  # length 4: 2 bytes BOM + 2 bytes a
b'\xff\xfea\x00'
  BOM.....a....
>>> 'aaa'.encode('utf-16')  # length 8: 2 bytes BOM + 3*2 bytes of a
b'\xff\xfea\x00a\x00a\x00'
  BOM.....a....a....a....

Seeing the BOM in the data might be clearer if you look at raw bits using the bitstring module:

>>> # pip install bitstring
>>> from bitstring import Bits
>>> Bits(bytes='a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> Bits(bytes='aaa'.encode('utf-32')).bin
'11111111111111100000000000000000011000010000000000000000000000000110000100000000000000000000000001100001000000000000000000000000'
 BOM.............................a...............................a...............................a...............................