Search code examples
pythonjsonunicodemojibake

Mojibake when reading JSON containing escaped unicode - wrongly decoded as Latin-1?


I have a JSON file that contains /u escaped unicode characters, however when I read this in Python, the escaped characters are seemingly incorrectly decoded as Latin-1 rather than UTF-8. Calling .encode('latin-1').decode('utf-8') on the affected strings seems to fix this, but why is it happening, and is there a way to specify to json.load that escape sequences should be read as unicode rather than Latin-1?

JSON file message.json, which should contain a message composed of a "Grinning Face With Sweat" emoji:

{
    "message": "\u00f0\u009f\u0098\u0085"
}

Python:

>>> with open('message.json') as infile:
...     msg_json = json.load(infile)
... 
>>> msg_json
{'message': 'ð\x9f\x98\x85'}
>>> msg_json['message']
'ð\x9f\x98\x85'
>>> msg_json['message'].encode('latin-1').decode('utf-8')
'😅'

Setting the encoding parameter in open or json.load doesn't seem to change anything, as the JSON file is plain ASCII, and the unicode is escaped within it.


Solution

  • What you have there is not the correct notation for the 😅 emoji; it really means "ð" and three undefined codepoints, so the translation you get is correct! (The \u... notation is independent of encoding.)

    The proper notation for 😅, unicode U+1F605, in JavaScript is \ud83d\ude05. Use that in the JSON.

    {
        "message": "\ud83d\ude05"
    }
    

    If, on the other hand, your question is how you can get the correct results from the wrong data, then yes, as the comments say you may have to run through some hoops to do that.