Search code examples
pythonunicodenltkemoticons

extract all possible emoticons from a python list


Objective

I am trying to extract all possible emoticons from a unicode word list. I am using Python3 with anaconda installation, therefore I can not use a package such as emoji.py.

Here is a sample bow of word list.

lst = ['✅','türkçe','Çile','ısp','İst','ğ','some','#','@','@one','#thing','','1','41','ç','ö','⏱','⏱','👏','₺','€',':)',':/']

Expected output is like this:

out = ['✅','⏱', '⏱','👏']

Attempt 1

List comprehension to check if all chars are ASCII:

[w for w in lst if len(w) != len(w.encode())]

However, this is not giving the desired output because there are non ASCII letters in text. Also, currency symbols are not emoticons.

['✅', 'türkçe', 'Çile', 'ısp', 'İst', 'ğ', 'ç', 'ö', '⏱', '⏱', '👏', '₺', '€']

Attempt 2

Using NTLK emoticons regular expression

from nltk.tokenize.casual import EMOTICON_RE
EMOTICON_RE.findall(' '.join(lst))

However, EMOTICON_RE can only extract expressions such as :) :/ :(

Here is the list of what I am to considering to be emoticons.

I tried to build a list of emoticons to see if my word exists in that list, but I could not build a list of emoticons from unicode character codes.

Can you please suggest?


Solution

  • I think that all of those characters are in Symbol, other category. Therefore you can do

    [w for w in lst if any(c for c in w if unicodedata.category(c) == 'So')]