Search code examples
linuxpython-2.7raspberry-pilatin1utf8-decode

encoding / decoding special characters in python 2.7


Trying to do some reverse engineering using the Raspberry Pi. I am piping the output of a Can Analyzer to a python script. My main problem is that the 'extended' ascii characters are not displayed correctly in the end.

I am running the script as follows:

./candump blablabla | python test.py

The outcome of ./candump is a 'hex' string e.g. "3631B043" which should be translated to "61°C" in this case. Since I'm doing reverse engineering, I don't know the encoding used, I just know that only 1 byte is used for the degree symbol ("B0"). This is also the case for the "ü" symbol ("FC").

When I'm trying some things (after googling quite a lot) in python on the pi I finally succeeded in getting the correct format. However I have no clue how it works and it doesn't work anymore when I try the same in my python script. Here is the attempt:

pi@raspberrypi /test/cant/can-test $ python
Python 2.7.3 (default, Mar 18 2014, 05:13:23)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> input = "3631B043"
>>> hex = input.decode("hex")
>>> len(hex)
4
>>> print hex
61▒C
>>> print hex.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 2: ordinal not in range(128)
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> print hex.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 2: ordinal not in range(128)
>>> sys.setdefaultencoding('latin1')
>>> print hex.encode('utf8')
61°C
>>>

Can someone explain the reasoning behind this and why this option would not work anymore when piping is used. Thx


Solution

  • >>> print '3631B043'.decode('hex').decode('iso-8859-1')
    61°C
    

    The first decode decodes the hex to bytes. The second decode converts from bytes using Latin-1 (aka ISO-8859-1) to Unicode. At this point, you have a proper Unicode string, which can be further encoded into different encodings if you desire.