Search code examples
pythonunicodeutf-8utf-16byte-order-mark

length of string in python3.5 with different encode


I tried this in python to get the length of a string in bytes.

>>> s = 'a'
>>> s.encode('utf-8')
b'a'
>>> s.encode('utf-16')
b'\xff\xfea\x00'
>>> s.encode('utf-32')
b'\xff\xfe\x00\x00a\x00\x00\x00'
>>> len(s.encode('utf-8'))
1
>>> len(s.encode('utf-16'))
4
>>> len(s.encode('utf-32'))
8

utf-8 uses one byte to store an ascii character, as expected, but why does utf-16 use 4 bytes? What is len() measuring exactly?


Solution

  • TL;DR:

    UTF-8 : 1 byte 'a'
    UTF-16: 2 bytes 'a' + 2 bytes BOM
    UTF-32: 4 bytes 'a' + 4 bytes BOM
    
    • UTF-8 is a variable length encoding, and characters may be encoded with lengths between 1 to 4 bytes. It was designed to match ASCII for the first 128 characters, so an 'a' is a single byte width.

    • UTF-16 is a variable-length encoding; code points are encoded with one or two 16-bit code units (i.e. 2 or 4 bytes), an 'a' is 2 bytes wide.

    • UTF-32 is fixed width, exactly 32 bits per code point, each and every character is 4 bytes wide so an 'a' is 4 bytes wide.

    For the lengths of an "a" encoded in UTF-8, UTF-16, UTF-32, you may expect to see results of 1, 2, 4 respectively. The actual results of 1, 4, 8 are inflated because in the last two cases the output is including the BOM - that \xff\xfe thing is the byte order mark, used to indicate the endianness of the data.

    The unicode standard permits the BOM in UTF-8, but neither requires nor recommends its use (it has no meaning there), which is why you don't see any BOM in the first example. The UTF-16 BOM is 2 bytes wide and the UTF-32 BOM is 4 bytes wide (actually it's just the same as a UTF-16 BOM, plus some padding nulls).

    >>> 'a'.encode('utf-16')  # length 4: 2 bytes BOM + 2 bytes a
    b'\xff\xfea\x00'
      BOM.....a....
    >>> 'aaa'.encode('utf-16')  # length 8: 2 bytes BOM + 3*2 bytes of a
    b'\xff\xfea\x00a\x00a\x00'
      BOM.....a....a....a....
    

    Seeing the BOM in the data might be clearer if you look at raw bits using the bitstring module:

    >>> # pip install bitstring
    >>> from bitstring import Bits
    >>> Bits(bytes='a'.encode('utf-32')).bin
    '1111111111111110000000000000000001100001000000000000000000000000'
    >>> Bits(bytes='aaa'.encode('utf-32')).bin
    '11111111111111100000000000000000011000010000000000000000000000000110000100000000000000000000000001100001000000000000000000000000'
     BOM.............................a...............................a...............................a...............................