I have this file which store some strings data for i18n. Here is the raw data in HEX:
70617573652E70726F636573732E6D73
673D20496D706F737369626C65206427
696E746572726F6D707265206C652070
726F6365737375732E204C6520737461
74757420646F697420EFBFBD74726520
27456E20636F7572732064276578EFBF
BD637574696F6E272E0D0A
in Base64
cGF1c2UucHJvY2Vzcy5tc2c9IEltcG9zc2libGUgZCdpbnRlcnJvbXByZSBsZSBwcm9jZXNzdXMuIExlIHN0YXR1dCBkb2l0IO+/vXRyZSAnRW4gY291cnMgZCdleO+/vWN1dGlvbicuDQo=
If I try to decode in UTF-8 using this tool i get:
pause.process.msg= Impossible d'interrompre le processus. Le statut doit �tre 'En cours d'ex�cution'.
note that 0D0A
is simple ASCII \r\n
I am expecting:
pause.process.msg= Impossible d'interrompre le processus. Le statut doit être 'En cours d'exécution'.
I am french and i can state that both é
and ê
char are encoded to EFBFBD
in this data.
Then my 'simple' question is what is this encoding?
Note that UTF8 encodes é
to C3A9
, Latin-1 E9
.
The text was decoded incorrectly in the past then encoded as UTF-8 again.
import unicodedata as ud
s = bytes.fromhex('''\
70617573652E70726F636573732E6D73
673D20496D706F737369626C65206427
696E746572726F6D707265206C652070
726F6365737375732E204C6520737461
74757420646F697420EFBFBD74726520
27456E20636F7572732064276578EFBF
BD637574696F6E272E0D0A''').decode('utf8')
print('Original: ', s)
print('Escapes: ', ascii(s))
print('character name:', ud.name('\ufffd'))
print()
# How this could happen
s = "pause.process.msg= Impossible d'interrompre le processus. Le statut doit être 'En cours d'exécution'.\r\n"
print('Correct: ', s)
s2 = s.encode('latin1').decode('utf8', errors='replace') # encoded, then decoded incorrectly
print('Bad decode:', s2)
print('Escapes: ', ascii(s2))
Output:
Original: pause.process.msg= Impossible d'interrompre le processus. Le statut doit �tre 'En cours d'ex�cution'.
Escapes: "pause.process.msg= Impossible d'interrompre le processus. Le statut doit \ufffdtre 'En cours d'ex\ufffdcution'.\r\n"
character name: REPLACEMENT CHARACTER
Correct: pause.process.msg= Impossible d'interrompre le processus. Le statut doit être 'En cours d'exécution'.
Bad decode: pause.process.msg= Impossible d'interrompre le processus. Le statut doit �tre 'En cours d'ex�cution'.
Escapes: "pause.process.msg= Impossible d'interrompre le processus. Le statut doit \ufffdtre 'En cours d'ex\ufffdcution'.\r\n"