Search code examples
python-3.xstringdecodeencode

How to decode bytes object that contains invalid bytes, Python3


In python2, I can produce these hex bytes represented in a string format all day '\x00\xaa\xff'

>>>’00'.decode('hex') + 'aa'.decode('hex') + 'ff'.decode('hex')
>>>'\x00\xaa\xff'

Similarily, I can do this in python3

>>> bytes.fromhex(’00’) + bytes.fromhex(‘aa’) + bytes.fromhex(‘ff’)
>>>b'\x00\xaa\xff'

According to py2->py3 changes here

Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data.

So with the Py2 version the output is a string while the Py3 version’s is binary data of type bytes

But I really need a string version!

According to the aforementioned doc:

As the str and bytes types cannot be mixed, you must always explicitly convert between them. Use str.encode() to go from str to bytes, and bytes.decode() to go from bytes to str. You can also use bytes(s, encoding=...) and str(b, encoding=...), respectively.

Ok, so now I have to decode this binary data of type bytes…

>>> b'\x00\xaa\xff'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 1: invalid start byte

Oops! I don’t care about UTF-8 encodings here.

Can I just get a dummy pass-through codec?

PS

Why do I need '\x00\xaa\xff' instead of b'\x00\xaa\xff' ?

Because I am taking this string and passing it into

a crc function written in pure python

crc16pure.crc16xmodem('\x00\xaa\xff')

This function expects to iterate through a string composed of bytes. If I give the function b'\x00\xaa\xff' then that is just a number which cannot be iterated with.


Solution

  • The question: Can I just get a dummy pass-through codec?

    The answer: Yes, use iso-8859-1

    In python3, the following doesn't work

    b'\x00\xaa\xff'.decode()
    

    The default codec 'utf-8' can't decode byte 0xaa

    As long you don't care about the character sets (as in, what char you see when you print()) and just want a string of 8bit chars like what you would get in python2, then use an 8bit codec iso-8859-1

    b'\x00\xaa\xff'.decode('iso-8859-1')