Search code examples
c#stringemoji

How can i tell a string starts with an emoji and get the first emoji in the string, without using regex?


I've seen some answers here providing monstrous regular expressions to get emojis from a string. But is there a more algorithmic approach? I mean, operation systems and browsers parse emoji-containing strings somehow, i doubt its done with regexes?


Solution

  • I've knocked up the below extension method / demo; hopefully that's some help.

    Caveat: I don't know much about this area; so please don't treat this as gospel; and ensure you test thoroughly before relying on it.

    In fact - the reason the regex answer comes up so often is probably because that's currently the best answer, given the complexity.

    using System;
    using System.Globalization;
    
    public class Demo
    {
        void Main()
        {
            var emojiString = "😀 that's an emoji";
            Console.WriteLine(emojiString);
            Console.WriteLine("First actual char is: [{0}]... As chars are only 16 bits, and 😀 is 32", emojiString[0]);
            Console.WriteLine("First char is an emoticon? {0}", emojiString.IsEmoji(0)); 
            Console.WriteLine("Second char is an emoticon? {0}",emojiString.IsEmoji(1)); 
        }
    }
    
    public static class UnicodeCodePointExtensions 
    {
        // uses StringInfo from the System.Globalization namespace: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo?view=net-7.0
        public static bool IsEmoji(this string inputString, int index) 
        {
            return (new StringInfo(inputString)).IsEmoji(index);
        }
        public static bool IsEmoji(this StringInfo inputString, int index)
        {
            var firstUnicodeChar = inputString.SubstringByTextElements(index, 1); // gets the char at the given index
            var charCode = Char.ConvertToUtf32(firstUnicodeChar, 0); // gets a numeric value for this char; note: we first get the char by index rather than just passing the index as an additional argument here since if there are additional utf32 chars earlier in the string our index would be offset
            return IsEmoticon(charCode) 
            || IsMiscPictograph(charCode)
            || IsTransport(charCode)
            || IsMiscSymbol(charCode)
            || IsDingbat(charCode)
            || IsVariationSelector(charCode)
            || IsSupplemental(charCode)
            || IsFlag(charCode);
        }
        
        // these range values from https://stackoverflow.com/a/36258684/361842
        private static bool IsEmoticon(int charCode) =>
            0x1F600 <= charCode && charCode <= 0x1F64F;
        private static bool IsMiscPictograph(int charCode) =>
            0x1F680 <= charCode && charCode <= 0x1F5FF;
        private static bool IsTransport(int charCode) =>
            0x2600 <= charCode && charCode <= 0x1F6FF;
        private static bool IsMiscSymbol(int charCode) =>
            0x2700 <= charCode && charCode <= 0x26FF;
        private static bool IsDingbat(int charCode) =>
            0x2700 <= charCode && charCode <= 0x27BF;
        private static bool IsVariationSelector(int charCode) =>
            0xFE00 <= charCode && charCode <= 0xFE0F;
        private static bool IsSupplemental(int charCode) =>
            0x1F900 <= charCode && charCode <= 0x1F9FF;
        private static bool IsFlag(int charCode) =>
            0x1F1E6 <= charCode && charCode <= 0x1F1FF;
    }
    
    

    The unicode scalar ranges used in the private methods can be found here: https://stackoverflow.com/a/36258684/361842

    Info on how to get the Nth "character" from a string where not all characters are "char"s here: https://www.meziantou.net/how-to-correctly-count-the-number-of-characters-of-a-string.htm

    Related MS documentation on the StringInfo class / SubstringByTextElements method here: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.substringbytextelements?view=net-7.0