Search code examples
pythoncharacter-encoding

Fix badly encoded character


I visited a website that contained a character ø in the text. A character with unicode codepoint 248 (0xf8 in hexadecimal). Indeed, the Python console confirms that:

>>> chr(248)
'ø'

But since I understand the text, I know that the character has been encoded with wrong encoding. It should be ř instead. And indeed, the Windows-1250 codepoint table confirms that value 0xf8 equals to the character ř.

What conversions should I apply to fix the text encoding? To transform ø to ř?

I cannot figure out the correct sequence of functions. I, quite brainlessly, tried both:

>>> chr(248).encode().decode('windows-1250')
'ø'

and

>>> chr(248).encode('windows-1250').decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/encodings/cp1250.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 0: character maps to <undefined>

but, as you can see, none of it worked.


Solution

  • If you want to fix badly decoded text, you have to find out what (bad) encoding was used to decode the text and what (good) encoding should have been used instead. Then, revert the bad changes and apply the good ones. In code:

    'bad string'.encode('bad encoding').decode('good encoding')
    

    In this case, the bad encoding is ISO-8859-1 (known also under an alias Latin 1) so the correct fix is:

    >>> chr(248).encode('latin_1').decode('windows-1250')
    'ř'