I have a UTF-8 file with Spanish text, and some words with accent marks are displayed incorrectly in some of the software.
I believe my file is correct. For example, the name 'JESÚS' is encoded as 4A 45 53 C3 9A 53
.
>>> b'\x4A\x45\x53\xC3\x9A\x53'.decode('utf-8')
'JESÚS'
c39a
is the correct UTF-8 encoding for \u00da
, according to http://www.fileformat.info/info/unicode/char/00da/index.htm.
So, why some software renders it incorrectly?
This is the result of using Latin-1 encoding instead of UTF-8. Two-byte UTF-8 sequence is incorrectly decoded into two characters.
>>> 'Ú'.encode('utf-8').decode('latin-1')
'Ã\x9a'
>>> 'É'.encode('utf-8').decode('latin-1')
'Ã\x89'
http://www.fileformat.info/info/unicode/char/9a/index.htm http://www.fileformat.info/info/unicode/char/89/index.htm
Both of these characters are control characters, so they may or may not be displayed in different software.
Moreover, repeating incorrect encoding-decoding corrupts the text even further:
>> 'Ú'.encode('utf-8').decode('latin-1').encode('utf-8').decode('latin-1')
'Ã\x83Â\x9a'
UPDATE: If you are seeing actual š and ‰ (and not invisible control characters), the wrong encoding is Windows-1252.
Windows-1252 is a superset of ISO 8859-1, with printable characters for 0x80-0x9f.
In Windows-1252 code points 0x9a and 0x89 correspond to characters š
and ‰
:
http://www.fileformat.info/info/unicode/char/0161/index.htm
http://www.fileformat.info/info/unicode/char/2030/index.htm
>>> 'Ú'.encode('utf-8').decode('Windows-1252')
'Ú'
>>> 'É'.encode('utf-8').decode('Windows-1252')
'É'