I am trying to extract all possible emoticons from a unicode word list.
I am using Python3 with anaconda installation, therefore I can not use a package such as emoji.py
.
Here is a sample bow of word list.
lst = ['✅','türkçe','Çile','ısp','İst','ğ','some','#','@','@one','#thing','','1','41','ç','ö','⏱','⏱','👏','₺','€',':)',':/']
Expected output is like this:
out = ['✅','⏱', '⏱','👏']
List comprehension to check if all chars are ASCII:
[w for w in lst if len(w) != len(w.encode())]
However, this is not giving the desired output because there are non ASCII letters in text. Also, currency symbols are not emoticons.
['✅', 'türkçe', 'Çile', 'ısp', 'İst', 'ğ', 'ç', 'ö', '⏱', '⏱', '👏', '₺', '€']
Using NTLK emoticons regular expression
from nltk.tokenize.casual import EMOTICON_RE
EMOTICON_RE.findall(' '.join(lst))
However, EMOTICON_RE
can only extract expressions such as :)
:/
:(
Here is the list of what I am to considering to be emoticons.
I tried to build a list of emoticons to see if my word exists in that list, but I could not build a list of emoticons from unicode character codes.
Can you please suggest?
I think that all of those characters are in Symbol, other category. Therefore you can do
[w for w in lst if any(c for c in w if unicodedata.category(c) == 'So')]