Search code examples
pythonfacebookunicodeemoji

Untangle Facebook reaction encoding?


I've downloaded my Facebook data in JSON format and want to perform analysis on it. It contains segments like:

{
          "reaction": "\u00e2\u009d\u00a4",
          "actor": "..."
},

The reaction here is a heart. However, if I print it in Python, obviously it comes out as simply those unicode characters (â¤), rather than a heart.

Does anyone know if there's somewhere that contains all of Facebook's reaction encodings?


Solution

  • The heart emoji character is encoded in Unicode as U+2764. In UTF-8 encoding form, that character would be represented as a sequence of three bytes, 0xE2 0x9D 0xA4.

    Facebook is mixing up UTF-8 encoding form versus the Unicode code point. Since it is escaping the character using \uxxxx format, it should present that as \u2764. But instead, in formatting the escaped sequence, it is (incorrectly) taking the UTF-8 byte sequence and re-interpreting each byte as though it were a complete character.

    (Unfortunately, there are still too many products that make similar errors in handling UTF-8 encoded strings. E.g., I've seen this in my car when music track details from a streaming service are displayed. Whenever you see garbage sequences of two, three or four characters from the Unicode Latin-1 Supplement block, that's what's going on.)

    You can find info about Facebook's reactions and the (nearest) corresponding Unicode emoji character at https://emojipedia.org/facebook.