Search code examples
pythontwitterunicodeemojipython-re

Remove unicode encoded emojis from Twitter tweet


For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of \ud83d\udcf8 (camera emoji) or \ud83c\uddeb\ud83c\uddf7 (french flag) for example.

I am using the python-package "re" and so far I was successful in removing "simple" unicodes like \u201c (double quotation mark) with something like

text = re.sub(u'\u201c', '', text)

However, when I am trying to remove more complex structures, like for example

text = re.sub(u'\ud83d\udcf8', '', text) # remove camera emoji
text = re.sub(u'\ud83c\uddeb\ud83c\uddf7', '', text) # remove french flag emoji

nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.

EDIT: Thanks to @Shawn Shroyer's answer i found out that

text = re.sub(u'\\ud83d\\udcf8', '', text)

works fine! I just had to escape the backslashes. Now only my second problem remains (see below).

The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like \u2019 (single quotation mark).


Solution

  • My suggestion would be to create an array of values you would like to replace and you need to escape the \ by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.

    import re
    to_remove_arr = [u"\ud83d\udcf8", u"\ud83c\uddeb\ud83c\uddf7"]
    pattern_str = "|".join(to_remove_arr)    
    text = re.sub(pattern_str, "", text)
    

    Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:

    text = text.encode("ascii", "ignore").decode()
    

    Edit: to remove only emojis I found:

    def strip_emoji(text):
        RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
        return RE_EMOJI.sub(r'', text)