Search code examples
encodingutf-8character-encoding

What is the encoding of this data? Is it corrupted?


I have this file which store some strings data for i18n. Here is the raw data in HEX:

70617573652E70726F636573732E6D73
673D20496D706F737369626C65206427
696E746572726F6D707265206C652070
726F6365737375732E204C6520737461
74757420646F697420EFBFBD74726520
27456E20636F7572732064276578EFBF
BD637574696F6E272E0D0A

in Base64

cGF1c2UucHJvY2Vzcy5tc2c9IEltcG9zc2libGUgZCdpbnRlcnJvbXByZSBsZSBwcm9jZXNzdXMuIExlIHN0YXR1dCBkb2l0IO+/vXRyZSAnRW4gY291cnMgZCdleO+/vWN1dGlvbicuDQo=

If I try to decode in UTF-8 using this tool i get:

pause.process.msg= Impossible d'interrompre le processus. Le statut doit �tre 'En cours d'ex�cution'.

note that 0D0A is simple ASCII \r\n

I am expecting:

pause.process.msg= Impossible d'interrompre le processus. Le statut doit être 'En cours d'exécution'.

I am french and i can state that both é and ê char are encoded to EFBFBD in this data.

Then my 'simple' question is what is this encoding?

Note that UTF8 encodes é to C3A9, Latin-1 E9.


Solution

  • The text was decoded incorrectly in the past then encoded as UTF-8 again.

    import unicodedata as ud
    
    s = bytes.fromhex('''\
    70617573652E70726F636573732E6D73
    673D20496D706F737369626C65206427
    696E746572726F6D707265206C652070
    726F6365737375732E204C6520737461
    74757420646F697420EFBFBD74726520
    27456E20636F7572732064276578EFBF
    BD637574696F6E272E0D0A''').decode('utf8')
    
    print('Original:      ', s)
    print('Escapes:      ', ascii(s))
    print('character name:', ud.name('\ufffd'))
    print()
    
    # How this could happen
    s = "pause.process.msg= Impossible d'interrompre le processus. Le statut doit être 'En cours d'exécution'.\r\n"
    print('Correct:   ', s)
    s2 = s.encode('latin1').decode('utf8', errors='replace') # encoded, then decoded incorrectly
    print('Bad decode:', s2)
    print('Escapes:  ', ascii(s2))
    

    Output:

    Original:       pause.process.msg= Impossible d'interrompre le processus. Le statut doit �tre 'En cours d'ex�cution'.
    
    Escapes:       "pause.process.msg= Impossible d'interrompre le processus. Le statut doit \ufffdtre 'En cours d'ex\ufffdcution'.\r\n"
    character name: REPLACEMENT CHARACTER
    
    Correct:    pause.process.msg= Impossible d'interrompre le processus. Le statut doit être 'En cours d'exécution'.
    
    Bad decode: pause.process.msg= Impossible d'interrompre le processus. Le statut doit �tre 'En cours d'ex�cution'.
    
    Escapes:   "pause.process.msg= Impossible d'interrompre le processus. Le statut doit \ufffdtre 'En cours d'ex\ufffdcution'.\r\n"