Search code examples
emojiutf-16utf

Comparing gender emojis in UTF-16


I made a program that reads an input string, compares it to check if it's certain emoji and returns a number depending on which emoji it is.

The problem comes with emojis with different genres. For example, the policeman emoji doesn't get detected. I tried comparing the string with "👮‍", but it wasn't detected. I tried adding the male symbol and comparing the string with "👮‍♂️♂️", but it didn't work either.

Example of a piece of my code code:

                case "💋":
                case "🕵":
                    Send(args[1] + " 70%", update.Message.Chat.Id);
                    break;
                case "👳":
                case "🍃":
                case "🔮":
                case "🌟":
                    Send(args[1] + " 40%", update.Message.Chat.Id);
                    break;

All of them work except for 🕵 and 👳, which happen to be the ones with different genders.

Not sure if it matters, but language is C# and I'm programming in Visual Studio, which lets me copy and paste the emojis in there.

What am I doing wrong?


Solution

  • I tried comparing the string with "👮‍", but it wasn't detected.

    This Police emoji above is made of two Unicode "Characters", better called Codepoints: the Police Officer U+1F46E and a character modifier, the U+200D (Combining 4 dots above). If in the case statement you have only the Police Officer U+1F46E then it will not be found.

    You must be sure that the emojis that you pasted in the code are identical to the emoji that you received in the input string. Just displaying the string is confusing because they seem equal but aren't.

    In the source code I would place the 👮‍ as a comment and in the string of the case statement the Police Officer using the Codepoint escaping "\U0001F46E".

    case "\U0001F46E":        // 👮‍
    case "\U0001F46E\u200D":  // 👮‍ + ....
    

    Or

    const string PoliceOfficer = "\U0001F46E" // 👮‍
    ...
    case PoliceOfficer: 
    

    Notice the different escaping, upper \U for 8 hex digits and lower \u for 4 hex digits. Then when you don't recognize a string, just print it out (eventually in the debugger), get the correct escaping that build your string and add it to the case statements.

    As an alternative you could remove first from the input string all the combining marks, like the "\u200D" and then pass it to the case statement. And then eventually give an additional meaning to the combining character.