Search code examples
pythonunicodecounteremoji

Using collections.Counter to count emojis with different colors


I would like to use the collections.Counter class to count emojis in a string. It generally works fine, however, when I introduce colored emojis the color component of the emoji is separated from the emoji like so:

>>> import collections
>>> emoji_string = "πŸ‘ŒπŸ»πŸ‘ŒπŸΌπŸ‘ŒπŸ½πŸ‘ŒπŸΎπŸ‘ŒπŸΏ"
>>> emoji_counter = collections.Counter(emoji_string)
>>> emoji_counter.most_common()
[('πŸ‘Œ', 5), ('🏻', 1), ('🏼', 1), ('🏽', 1), ('🏾', 1), ('🏿', 1)]

How can I make the most_common() function return something like this instead:

[('πŸ‘ŒπŸ»', 1), ('πŸ‘ŒπŸΌ', 1), ('πŸ‘ŒπŸ½', 1), ('πŸ‘ŒπŸΎ', 1), ('πŸ‘ŒπŸΏ', 1)]

I'm using Python 3.6


Solution

  • You'll have to split your string into separate clusters. Each of your emoji is really two codepoints; the emoji and a EMOJI MODIFIER FITZPATRICK TYPE X codepoint:

    >>> print(emoji_string[0])
    πŸ‘Œ
    >>> print(emoji_string[1])
    🏻
    >>> print(emoji_string[:2])
    πŸ‘ŒπŸ»
    >>> print(ascii(emoji_string[:2]))
    '\U0001f44c\U0001f3fb'
    >>> import unicodedata
    >>> unicodedata.name(emoji_string[1])
    'EMOJI MODIFIER FITZPATRICK TYPE-1-2'
    

    You could use a regular expression to keep those with the preceding emoji:

    import re
    
    char_with_modifier = re.compile(r'(.[\U0001f3fb-\U0001f3ff]?)')
    split_emoji = char_with_modifier.findall(emoji_string)
    

    and count the result.

    Demo:

    >>> import re
    >>> from collections import Counter
    >>> emoji_string = "πŸ‘ŒπŸ»πŸ‘ŒπŸΌπŸ‘ŒπŸ½πŸ‘ŒπŸΎπŸ‘ŒπŸΏ"
    >>> char_with_modifier = re.compile(r'(.[\U0001f3fb-\U0001f3ff]?)')
    >>> Counter(char_with_modifier.findall(emoji_string))
    Counter({'πŸ‘ŒπŸ»': 1, 'πŸ‘ŒπŸΌ': 1, 'πŸ‘ŒπŸ½': 1, 'πŸ‘ŒπŸΎ': 1, 'πŸ‘ŒπŸΏ': 1})