Search code examples
javascriptfacebookunicodeinstagramunicode-escapes

Parsing JSON with escaped unicode characters displays incorrectly


I have downloaded JSON data from Instagram that I'm parsing in NodeJS and storing in MongoDB. I'm having an issue where escaped unicode characters are not displaying the correct emoji symbols when displayed on the client side.

For instance, here's a property from one of the JSON files I'm parsing and storing:

"title": "@mujenspirits is in the house!NEW York City \u00f0\u009f\u0097\u00bd\u00f0\u009f\u008d\u008e \nImperial Vintner Liquor Store"

The above example should display like this:

@mujenspirits is in the house!NEW York City 🗽🍎 Imperial Vintner Liquor Store

But instead looks like this:

@mujenspirits is in the house!NEW York City 🗽🎠Imperial Vintner Liquor Store

I found another SO question where someone had a similar problem and their solution works for me in the console using a simple string, but when used with JSON.parse still gives the same incorrect display. This is what I'm using now to parse the JSON files.

export default function parseJsonFile(filepath: string) {
  const value = fs.readFileSync(filepath)
  const converted = new Uint8Array(
    new Uint8Array(Array.prototype.map.call(value, (c) => c.charCodeAt(0)))
  )
  return JSON.parse(new TextDecoder().decode(converted))
}

For posterity, I found an additional SO question similar to mine. There wasn't a solution, however, one of the comments said:

The JSON files were generated incorrectly. The strings represent Unicode code points as escape codes, but are UTF-8 data decoded as Latin1

The commenter suggested encoding the loaded JSON to latin1 then decoding to utf8, but this didn't work for me either.

import buffer from 'buffer'

const value = fs.readFileSync(filepath)
const buffered = buffer.transcode(value, 'latin1', 'utf8')
return JSON.parse(buffered.toString())

I know pretty much nothing about character encoding, so at this point I'm shooting in the dark searching for a solution.


Solution

  • You can try converting the unicode escape sequences to bytes before parsing the JSON; probably, the utf8.js library can help you with that.

    Alternatively, the solution you found should work but only after unserializing the JSON (it will turn each unicode escape sequence into one character). So, you need to traverse the object and apply the solution to each string

    For example:

    function parseJsonFile(filepath) {
      const value = fs.readFileSync(filepath);
      return decodeUTF8(JSON.parse(value));
    }
    
    function decodeUTF8(data) {
      if (typeof data === "string") {
        const utf8 = new Uint8Array(
          Array.prototype.map.call(data, (c) => c.charCodeAt(0))
        );
        return new TextDecoder("utf-8").decode(utf8);
      }
    
      if (Array.isArray(data)) {
        return data.map(decodeUTF8);
      }
    
      if (typeof data === "object") {
        const obj = {};
        Object.entries(data).forEach(([key, value]) => {
          obj[key] = decodeUTF8(value);
        });
        return obj;
      }
    
      return data;
    }