Search code examples
pythonutf-8byte

When creating bytes with "b" prefix before string, what encoding does python use?


From the python doc:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

I know that I can create a bytes object with b prefix expression like: b'cool', this will convert a unicode string 'cool' into bytes. I'm aslo aware that bytes instance could be created by bytes() function but you need to specify the encoding argument: bytes('cool', 'utf-8').

From my understaing, I need to use one of the encoding rules if I want to tranlate a string into a sequence of bytes . I have done some experiments and it seems b prefix converts string into bytes using utf-8 encoding:

>>> a = bytes('a', 'utf-8')
>>> b'a' == a
True
>>> b = bytes('a', 'utf-16')
>>> b'a' == b
False

My question is when creating a bytes object through b prefix, what encoding does python use? Is there any doc that specifies this question? Does it use utf-8 or ascii as default?


Solution

  • The bytes type can hold arbitrary data. For example, (the beginning of) a JPEG image:

    >>> with open('Bilder/19/01/IMG_3388.JPG', 'rb') as f:
    ...     head = f.read(10)
    

    You should think of it as a sequence of integers. That's also how the type behaves in many aspects:

    >>> list(head)
    [255, 216, 255, 225, 111, 254, 69, 120, 105, 102]
    >>> head[0]
    255
    >>> sum(head)
    1712
    

    For reasons of convenience (and for historical reasons, I guess), the standard representation of the bytes, and its literals, are similar to strings:

    >>> head
    b'\xff\xd8\xff\xe1o\xfeExif'
    

    It uses ASCII printable characters where applicable, \xNN escapes otherwise. This is convenient if the bytes object represents text:

    >>> 'Zoë'.encode('utf8')
    b'Zo\xc3\xab'
    >>> 'Zoë'.encode('utf16')
    b'\xff\xfeZ\x00o\x00\xeb\x00'
    >>> 'Zoë'.encode('latin1')
    b'Zo\xeb'
    

    When you type bytes literals, Python uses ASCII to decode them. Characters in the ASCII range are encoded the same way in UTF-8, that's why you observed the equivalence of b'a' == bytes('a', 'utf8'). A bit less misleading might be the expression b'a' == bytes('a', 'ascii').