For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of \ud83d\udcf8
(camera emoji) or \ud83c\uddeb\ud83c\uddf7
(french flag) for example.
I am using the python-package "re" and so far I was successful in removing "simple" unicodes like \u201c
(double quotation mark) with something like
text = re.sub(u'\u201c', '', text)
However, when I am trying to remove more complex structures, like for example
text = re.sub(u'\ud83d\udcf8', '', text) # remove camera emoji
text = re.sub(u'\ud83c\uddeb\ud83c\uddf7', '', text) # remove french flag emoji
nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.
EDIT: Thanks to @Shawn Shroyer's answer i found out that
text = re.sub(u'\\ud83d\\udcf8', '', text)
works fine! I just had to escape the backslashes. Now only my second problem remains (see below).
The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like \u2019
(single quotation mark).
My suggestion would be to create an array of values you would like to replace and you need to escape the \ by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.
import re
to_remove_arr = [u"\ud83d\udcf8", u"\ud83c\uddeb\ud83c\uddf7"]
pattern_str = "|".join(to_remove_arr)
text = re.sub(pattern_str, "", text)
Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:
text = text.encode("ascii", "ignore").decode()
Edit: to remove only emojis I found:
def strip_emoji(text):
RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
return RE_EMOJI.sub(r'', text)