Search code examples
unicodewhatsappemoji

WhatsApp Export: Is there a unicode emoticon compressing convention?


I'm just migrating from Windows Phone to Android and I found that I cannot migrate my chat history of WhatsApp (at least not easily). Now I wanted to use the export-to-mail-feature WhatsApp offers. It works quite well, but (mainly) for old messages, some of the emoticons are not properly transmitted.

Looking into the UTF-8 bytes representation, it appears as if these emoticons that are not shown only consist of three bytes instead of the four normal ones.

Example:

08.09.2013 20:00:10: Name: 

is exported instead of

08.09.2013 20:00:10: Name: 💩👽💀

Looking into the hexadecimal representation of the emoticons (only):

0  1  2  3  4  5  6  7  8  ... Dump
ee 81 9a ee 84 8c ee 84 9c     î.šî„Œî„œ

instead of

0  1  2  3  4  5  6  7  8  9  a  b  ... Dump
f0 9f 92 a9 f0 9f 91 bd f0 9f 92 80     💩👽💀

So my question is: Is this just an error occurring during the WhatsApp export or is it some form of compressing to reduce the file size? If so, is there a decoding algorithm that converts the "compressed" version into the normal one?


Solution

  • WhatsApp seems to be using the Softbank encoding of emoji in Unicode's Private Use Area, which was used on iOS versions 2–4 before they were standardised in their own Unicode block. For example:

    • 0xEE 0x84 0x8C is the UTF-8 encoding of U+E10C
    • U+E10C is a private use character that Softbank/Apple used to encode what would later be standardised as U+1F47D (👽).

    There is no official list of these mappings, nor is an a algorithmic conversion possible (a mapping table must be built). However, you can easily find compilations of the Softbank/Unicode mappings to build a converter between them.