I would like to use the collections.Counter class to count emojis in a string. It generally works fine, however, when I introduce colored emojis the color component of the emoji is separated from the emoji like so:
>>> import collections
>>> emoji_string = "ππ»ππΌππ½ππΎππΏ"
>>> emoji_counter = collections.Counter(emoji_string)
>>> emoji_counter.most_common()
[('π', 5), ('π»', 1), ('πΌ', 1), ('π½', 1), ('πΎ', 1), ('πΏ', 1)]
How can I make the most_common() function return something like this instead:
[('ππ»', 1), ('ππΌ', 1), ('ππ½', 1), ('ππΎ', 1), ('ππΏ', 1)]
I'm using Python 3.6
You'll have to split your string into separate clusters. Each of your emoji is really two codepoints; the emoji and a EMOJI MODIFIER FITZPATRICK TYPE X codepoint:
>>> print(emoji_string[0])
π
>>> print(emoji_string[1])
π»
>>> print(emoji_string[:2])
ππ»
>>> print(ascii(emoji_string[:2]))
'\U0001f44c\U0001f3fb'
>>> import unicodedata
>>> unicodedata.name(emoji_string[1])
'EMOJI MODIFIER FITZPATRICK TYPE-1-2'
You could use a regular expression to keep those with the preceding emoji:
import re
char_with_modifier = re.compile(r'(.[\U0001f3fb-\U0001f3ff]?)')
split_emoji = char_with_modifier.findall(emoji_string)
and count the result.
Demo:
>>> import re
>>> from collections import Counter
>>> emoji_string = "ππ»ππΌππ½ππΎππΏ"
>>> char_with_modifier = re.compile(r'(.[\U0001f3fb-\U0001f3ff]?)')
>>> Counter(char_with_modifier.findall(emoji_string))
Counter({'ππ»': 1, 'ππΌ': 1, 'ππ½': 1, 'ππΎ': 1, 'ππΏ': 1})