Search code examples
pythonpython-3.xunicodecounteremoji

Split and count emojis and words in a given string in Python


For a given string, I'm trying to count the number of appearances of each word and emoji. I did it already here for emojis that consists only from 1 emoji. The problem is that a lot of the current emojis are composed from a few emojis.

Like the emoji šŸ‘Øā€šŸ‘©ā€šŸ‘¦ā€šŸ‘¦ consists of four emojis - šŸ‘Øā€ šŸ‘©ā€ šŸ‘¦ā€ šŸ‘¦, and emojis with human skin color, for example šŸ™…šŸ½ is šŸ™… šŸ½ etc.

The problem boils down to how to split the string in the right order, and then counting them is easy.

There are some good questions that addressed the same thing, like link1 and link2 , but none of them applies to the general solution (or the solution is outdated or I just can't figure it out).

For example, if the string would be hello šŸ‘©šŸ¾ā€šŸŽ“ emoji hello šŸ‘Øā€šŸ‘©ā€šŸ‘¦ā€šŸ‘¦, then I'll have {'hello':2, 'emoji':1, 'šŸ‘Øā€šŸ‘©ā€šŸ‘¦ā€šŸ‘¦':1, 'šŸ‘©šŸ¾ā€šŸŽ“':1} My strings are from Whatsapp, and all were encoded in utf8.

I had many bad attempts. Help would be appreciated.


Solution

  • Use the 3rd party regex module, which supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character):

    >>> import regex
    >>> s='šŸ‘Øā€šŸ‘©ā€šŸ‘¦ā€šŸ‘¦šŸ™…šŸ½'
    >>> regex.findall(r'\X',s)
    ['šŸ‘Ø\u200dšŸ‘©\u200dšŸ‘¦\u200dšŸ‘¦', 'šŸ™…šŸ½']
    >>> for c in regex.findall('\X',s):
    ...     print(c)
    ... 
    šŸ‘Øā€šŸ‘©ā€šŸ‘¦ā€šŸ‘¦
    šŸ™…šŸ½
    

    To count them:

    >>> data = regex.findall(r'\X',s)
    >>> from collections import Counter
    >>> Counter(data)
    Counter({'šŸ‘Ø\u200dšŸ‘©\u200dšŸ‘¦\u200dšŸ‘¦': 1, 'šŸ™…šŸ½': 1})