Search code examples
pythonwhatsappemoji

Python does not read unicode U + FE0F


Suppose I export a WhatsApp chat as .txt and then read it with python. It seems that python does not read the right unicode combinations for emoji containing \uFE0F. For example the rainbow flag emojis 🏳️‍🌈 is U+1F3F3 U+FE0F U+200D U+1F308. However, if i read the file with python, using the code below, the flag emoji is read as \U0001f3f3\u200d\U0001f308. Is there a problem with my code? Is the file exported by WhatsApp incorrect? Or is there some other reason that this behaves likes this?

I want to write a program that finds all emoji in a chat, however \U0001f3f3\u200d\U0001f308 is not an existing emoji, so I get an error now...

def showchat():
    f = open("MyChat.txt", "r")
    lines = f.readlines()
    for l in lines:
        print(l)
        print(str(l.encode('unicode-escape')))
    f.close()

Solution

  • It appears that WhatsApp exports their files in UTF-8 format. So you must set that encoding when you open the file:

    f = open("MyChat.txt", "r", encoding="utf-8")
    

    It's possible that your Python installation already defaults to UTF-8, since you didn't get an error when your program attempted to read the file. Since '\ufe0f' is a special Unicode codepoint that doesn't represent an actual character, it may not be exported properly by WhatsApp. You would need to do a hex dump of the file to determine what it actually contains.