I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.
index| messages
----------------
1 |Hello! đ
2 |Good Morning đ
3 |How are you ?
4 | Good đ
5 | Ländern
Now I want to remove all these emojis from the DataFrame so it looks like this
index| messages
----------------
1 |Hello!
2 |Good Morning
3 |How are you ?
4 | Good
5 |Ländern
I tried the solution here but unfortunately it also removes all non-English letters like "ä" How can I remove emojis from a dataframe?
This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024
:
df = pd.DataFrame({'messages': ['Länder đŠđŞâ¤ď¸', 'Hello! đ']})
filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))
Result:
messages
0 Länder
1 Hello!
Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.