Search code examples
javascriptnode.jsencodingcharacter-encodingiconv

Fixing Facebook JSON Encoding in Node Js


I'm trying to decode the JSON you get from Facebook when you download your data. I'm using Node JS. The data has lots of weird unicode escapes that don't really make sense. Example:

"messages": [
    {
      "sender_name": "Emily Chadwick",
      "timestamp_ms": 1480314292125,
      "content": "So sorry that was in my pocket \u00f0\u009f\u0098\u0082\u00f0\u009f\u0098\u0082\u00f0\u009f\u0098\u0082",
      "type": "Generic"
    }
]

Which should decode as So sorry that was in my pocket 😂😂😂. Using fs.readFileSync(filename, "utf8") gets me So sorry that was in my pocket ððð instead, which is mojibake.

This question mentions that it's screwed up latin1 encoding, and that you can encode to latin1 and then decode to utf8. I tried to do that with:

import iconv from 'iconv-lite';
function readFileSync_fixed(filename) {
    var content = fs.readFileSync(filename, "binary");
    return iconv.decode(iconv.encode(content, "latin1"), "utf-8")
}
console.log(JSON.parse(readFileSync_fixed(filename)))

But I still get the mojibake version. Can anyone point me in the right direction? I'm unfamiliar with how iconv works in regard to this.


Solution

  • Solved... in a way. If there's a better way to do it, let me know.

    So, here's the amended function

    readFacebookJson(filename) {
        var content = fs.readFileSync(filename, "utf8");
        const json = JSON.parse(converted)
        return json
    }
    
    fixEncoding(string) {
        return iconv.decode(iconv.encode(string, "latin1"), "utf8")
    }
    

    It wasn't the readFileSync() screwing things up, it was the JSON.parse(). So - we read the file as utf8 like usual, however, we then need to do the latin1 encoding/decoding on the strings that are now properties of the JSON file, not the whole JSON file before it's parsed. I did this with a map().

    messages = readFacebookJson(filename).messages.map(message => {
        const toReturn = message;
        toReturn.sender_name = fixEncoding(toReturn.sender_name)
        if (typeof message.content !== "undefined") {
            toReturn.content = fixEncoding(message.content)
        }
        return toReturn;
    }),
    

    The issue here is of course that some properties might be missed. So make sure you know what properties contain what.