Search code examples
pythonunicodeutf-8mojibake

Unicode characters Ú and É are displayed incorrectly as Ú and É


I have a UTF-8 file with Spanish text, and some words with accent marks are displayed incorrectly in some of the software.

I believe my file is correct. For example, the name 'JESÚS' is encoded as 4A 45 53 C3 9A 53.

>>> b'\x4A\x45\x53\xC3\x9A\x53'.decode('utf-8')
'JESÚS'

c39a is the correct UTF-8 encoding for \u00da, according to http://www.fileformat.info/info/unicode/char/00da/index.htm.

So, why some software renders it incorrectly?


Solution

  • This is the result of using Latin-1 encoding instead of UTF-8. Two-byte UTF-8 sequence is incorrectly decoded into two characters.

    >>> 'Ú'.encode('utf-8').decode('latin-1')
    'Ã\x9a'
    >>> 'É'.encode('utf-8').decode('latin-1')
    'Ã\x89'
    

    http://www.fileformat.info/info/unicode/char/9a/index.htm http://www.fileformat.info/info/unicode/char/89/index.htm

    Both of these characters are control characters, so they may or may not be displayed in different software.

    Moreover, repeating incorrect encoding-decoding corrupts the text even further:

    >> 'Ú'.encode('utf-8').decode('latin-1').encode('utf-8').decode('latin-1')
    'Ã\x83Â\x9a'
    

    UPDATE: If you are seeing actual š and ‰ (and not invisible control characters), the wrong encoding is Windows-1252.

    Windows-1252 is a superset of ISO 8859-1, with printable characters for 0x80-0x9f.

    In Windows-1252 code points 0x9a and 0x89 correspond to characters š and : http://www.fileformat.info/info/unicode/char/0161/index.htm http://www.fileformat.info/info/unicode/char/2030/index.htm

    >>> 'Ú'.encode('utf-8').decode('Windows-1252')
    'Ú'
    >>> 'É'.encode('utf-8').decode('Windows-1252')
    'É'