From the python doc:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
I know that I can create a bytes object with b
prefix expression like: b'cool'
, this will convert a unicode string 'cool'
into bytes. I'm aslo aware that bytes instance could be created by bytes()
function but you need to specify the encoding argument: bytes('cool', 'utf-8')
.
From my understaing, I need to use one of the encoding rules if I want to tranlate a string into a sequence of bytes . I have done some experiments and it seems b
prefix converts string into bytes using utf-8 encoding:
>>> a = bytes('a', 'utf-8')
>>> b'a' == a
True
>>> b = bytes('a', 'utf-16')
>>> b'a' == b
False
My question is when creating a bytes object through b
prefix, what encoding does python use? Is there any doc that specifies this question? Does it use utf-8 or ascii as default?
The bytes
type can hold arbitrary data.
For example, (the beginning of) a JPEG image:
>>> with open('Bilder/19/01/IMG_3388.JPG', 'rb') as f:
... head = f.read(10)
You should think of it as a sequence of integers. That's also how the type behaves in many aspects:
>>> list(head)
[255, 216, 255, 225, 111, 254, 69, 120, 105, 102]
>>> head[0]
255
>>> sum(head)
1712
For reasons of convenience (and for historical reasons, I guess), the standard repr
esentation of the bytes, and its literals, are similar to strings:
>>> head
b'\xff\xd8\xff\xe1o\xfeExif'
It uses ASCII printable characters where applicable, \xNN
escapes otherwise.
This is convenient if the bytes
object represents text:
>>> 'Zoë'.encode('utf8')
b'Zo\xc3\xab'
>>> 'Zoë'.encode('utf16')
b'\xff\xfeZ\x00o\x00\xeb\x00'
>>> 'Zoë'.encode('latin1')
b'Zo\xeb'
When you type bytes
literals, Python uses ASCII to decode them.
Characters in the ASCII range are encoded the same way in UTF-8, that's why you observed the equivalence of b'a' == bytes('a', 'utf8')
.
A bit less misleading might be the expression b'a' == bytes('a', 'ascii')
.