I can achieve to count the occurences of a string by doing the following class / method :
private List<CountClass> CountCharacterOccurences(string theText)
{
List<CountClass> theCountList = new();
while (theText.Length > 0)
{
int cal = 0;
for (int j = 0; j < theText.Length; j++)
if (theText[0] == theText[j])
cal++;
theCountList.Add(new CountClass { Category = theText[0].ToString(), Count = cal });
theText = theText.Replace(theText[0].ToString(), string.Empty);
}
return theCountList;
}
However, if my string contains Emojis, my logic does not work : it seems emojis are coded on 2 and/or more chars, so my "read the string by character" is wrong.
I'm able to identify / isolate in my string the emoji list using a RegEx, but this seems not useful.
Any help ? Thanks !
I assume you want to treat each grapheme cluster of the string as a separate character. A grapheme cluster is displayed as a single "unit" of text. In addition to single char
s and surrogate pairs, this also includes things like emojis that are modified with skin tone modifiers, zero-width sequences, combining diacritics etc. This means that a "man with dark skin" emoji would be counted differently as a "man with light skin" emoji.
You can use StringInfo.GetTextElementEnumerator
to iterate through the grapheme clusters:
using System.Globalization;
var dictionary = new Dictionary<string, int>();
var graphemeEnumerator = StringInfo.GetTextElementEnumerator("👨🏿👨🏿👨");
while(graphemeEnumerator.MoveNext()) {
var grapheme = graphemeEnumerator.GetTextElement();
if (dictionary.ContainsKey(grapheme)) {
dictionary[grapheme]++;
} else {
dictionary.Add(grapheme, 1);
}
}
// { [👨🏿, 2], [👨🏻, 1] }
You can then convert the dictionary into your CountClass
if you want.
Note that Dmitry Bychenko's answer iterates over the runes (aka Unicode scalars) instead. For "👨🏿👨🏿👨", their answer will count 3 man emojis, 2 dark skin tone modifiers, and 1 light skin tone modifier.