I've downloaded my Facebook data in JSON format and want to perform analysis on it. It contains segments like:
{
"reaction": "\u00e2\u009d\u00a4",
"actor": "..."
},
The reaction here is a heart. However, if I print it in Python, obviously it comes out as simply those unicode characters (â¤), rather than a heart.
Does anyone know if there's somewhere that contains all of Facebook's reaction encodings?
The heart emoji character is encoded in Unicode as U+2764. In UTF-8 encoding form, that character would be represented as a sequence of three bytes, 0xE2 0x9D 0xA4
.
Facebook is mixing up UTF-8 encoding form versus the Unicode code point. Since it is escaping the character using \uxxxx
format, it should present that as \u2764
. But instead, in formatting the escaped sequence, it is (incorrectly) taking the UTF-8 byte sequence and re-interpreting each byte as though it were a complete character.
(Unfortunately, there are still too many products that make similar errors in handling UTF-8 encoded strings. E.g., I've seen this in my car when music track details from a streaming service are displayed. Whenever you see garbage sequences of two, three or four characters from the Unicode Latin-1 Supplement block, that's what's going on.)
You can find info about Facebook's reactions and the (nearest) corresponding Unicode emoji character at https://emojipedia.org/facebook.