Search code examples
jsonfacebook

What encoding Facebook uses in JSON files from data export?


I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.

Here's an example of such a string:

"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"

When I try parse the string for example with javascript's JSON.parse() and print it out I get:

"nejnižší bod: 0 mnm Benátky\n"

While it should be

"nejnižší bod: 0 mnm Benátky\n"

I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.

I've been able to figure out these characters so far:

'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',

So what is this weird encoding? Is there any known tool that can correctly decode it?


Solution

  • Thanks to Jen's excellent question and Shawn's comment.

    Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.

    What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.

    This is how I do it in Ruby (see gist):

    require 'json'
    require 'uri'
    
    bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/
    
    txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
      $1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
    end
    
    good_data = JSON.load(txt)
    

    With bytes_re we catch all sequences of bad Unicode characters.

    Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.

    The [1...-1] is to remove quotes inserted by #to_json

    I wanted to explain the code because question is not ruby specific and reader may use another language.

    I guess somebody can do it with a sufficiently ugly sed command..