Unicode characters Ú and É are displayed incorrectly as Ãš and Ã‰

I have a UTF-8 file with Spanish text, and some words with accent marks are displayed incorrectly in some of the software.

I believe my file is correct. For example, the name 'JESÚS' is encoded as 4A 45 53 C3 9A 53.

>>> b'\x4A\x45\x53\xC3\x9A\x53'.decode('utf-8')
'JESÚS'

c39a is the correct UTF-8 encoding for \u00da, according to http://www.fileformat.info/info/unicode/char/00da/index.htm.

So, why some software renders it incorrectly?

Solution

This is the result of using Latin-1 encoding instead of UTF-8. Two-byte UTF-8 sequence is incorrectly decoded into two characters.

>>> 'Ú'.encode('utf-8').decode('latin-1')
'Ã\x9a'
>>> 'É'.encode('utf-8').decode('latin-1')
'Ã\x89'

Both of these characters are control characters, so they may or may not be displayed in different software.

Moreover, repeating incorrect encoding-decoding corrupts the text even further:

>> 'Ú'.encode('utf-8').decode('latin-1').encode('utf-8').decode('latin-1')
'Ã\x83Â\x9a'

UPDATE: If you are seeing actual š and ‰ (and not invisible control characters), the wrong encoding is Windows-1252.

Windows-1252 is a superset of ISO 8859-1, with printable characters for 0x80-0x9f.

>>> 'Ú'.encode('utf-8').decode('Windows-1252')
'Ãš'
>>> 'É'.encode('utf-8').decode('Windows-1252')
'Ã‰'