Search code examples
pythonpython-3.xbytestream

Difference between bytes() and b''


I have the following str:
"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"

This comes from a filename: Расшифровка_RootKit.com_63k.txt

My problem is a cannot reverse the first str to the second one. I have tried a few things, using en/decode(), bytes(), etc but I did not manage.

One thing I noticed was b'' and bytes() have different outputs:

path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))
print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode('utf8'))

Results:

РаÑÑиÑ
         Ñовка_RootKit.com_63k.txt
Расшифровка_RootKit.com_63k.txt

So I wonder what is the difference between b'' and bytes(); maybe it will help me solving my problem !


Solution

  • You may want to use solution with latin1, scroll to that answer firstly. This answer works if you accidentally copied bytes content and pasted as a string.

    If you want to convert them back to bytes, use the following code:

    In [22]: path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
    
    In [23]: bytes(map(ord, path)).decode('utf-8')
    Out[23]: 'Расшифровка_RootKit.com_63k.txt'
    

    Explanation is quite simple, lets use the first character from the string:

    In [40]: '\xd0'
    Out[40]: 'Ð'
    
    In [41]: b'\xd0'
    Out[41]: b'\xd0'
    

    As you can see, string converts \xd0 to a unicode character with number 0xd0, while bytes just interprets this as a single byte.

    UTF-8 uses the following mask for all characters between U+0080 and U+07FF: 110xxxxx for the first byte and 10xxxxxx for the second byte. This is exactly what you gets when directly converting that string to bytes:

    In [43]: [bin(x) for x in '\xd0'.encode('utf-8')]
    Out[43]: ['0b11000011', '0b10010000']
    

    And the actual symbol code is 00011 + 010000 (concatenation, not addition), which is 0xd0:

    In [44]: hex(int('00011010000', 2))
    Out[44]: '0xd0'
    

    To get this number from a character we can use ord:

    In [45]: hex(ord('\xd0'))
    Out[45]: '0xd0'
    

    And then just applying it to the whole string and converting it back to bytes:

    In [46]: bytes(map(ord, path)).decode('utf-8')
    Out[46]: 'Расшифровка_RootKit.com_63k.txt'
    

    Note that if your string character does not fit in byte for some reason the code above will raise an error:

    In [47]: bytes([ord(chr(256))])
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-49-5555e18dbece> in <module>
    ----> 1 bytes([ord(chr(256))])
    
    ValueError: bytes must be in range(0, 256)