Search code examples
c#emoji

C# Counting occurences of strings having emoji


I can achieve to count the occurences of a string by doing the following class / method :

private List<CountClass> CountCharacterOccurences(string theText)
{

    List<CountClass> theCountList = new();

    while (theText.Length > 0)
    {
        int cal = 0;
        for (int j = 0; j < theText.Length; j++)
            if (theText[0] == theText[j])
                cal++;

        theCountList.Add(new CountClass { Category = theText[0].ToString(), Count = cal });

        theText = theText.Replace(theText[0].ToString(), string.Empty);
    }

    return theCountList;
}

However, if my string contains Emojis, my logic does not work : it seems emojis are coded on 2 and/or more chars, so my "read the string by character" is wrong.

I'm able to identify / isolate in my string the emoji list using a RegEx, but this seems not useful.

Any help ? Thanks !


Solution

  • I assume you want to treat each grapheme cluster of the string as a separate character. A grapheme cluster is displayed as a single "unit" of text. In addition to single chars and surrogate pairs, this also includes things like emojis that are modified with skin tone modifiers, zero-width sequences, combining diacritics etc. This means that a "man with dark skin" emoji would be counted differently as a "man with light skin" emoji.

    You can use StringInfo.GetTextElementEnumerator to iterate through the grapheme clusters:

    using System.Globalization;
    
    var dictionary = new Dictionary<string, int>();
    var graphemeEnumerator = StringInfo.GetTextElementEnumerator("👨🏿👨🏿👨");
    while(graphemeEnumerator.MoveNext()) {
        var grapheme = graphemeEnumerator.GetTextElement();
        if (dictionary.ContainsKey(grapheme)) {
            dictionary[grapheme]++;
        } else {
            dictionary.Add(grapheme, 1);
        }
    }
    
    // { [👨🏿, 2], [👨🏻, 1] }
    

    You can then convert the dictionary into your CountClass if you want.

    Note that Dmitry Bychenko's answer iterates over the runes (aka Unicode scalars) instead. For "👨🏿👨🏿👨", their answer will count 3 man emojis, 2 dark skin tone modifiers, and 1 light skin tone modifier.