Search code examples
c#.netutf-8utf8mb4

How to remove any utf8mb4 characters in string


Using C# how can utf8mb4 characters (emoji, etc.) be removed from a string, so that the result is full utf8 compliant.

Most of the solutions involve changing the database configuration, but unfortunately I don't have that possibility.


Solution

  • This should replace surrogate characters with a replacementCharacter (that could even be string.Empty)

    This is a MySql problem, given the utf8mb4. Here there is the difference between utf8 and utf8mb4 in MySql. The difference is that utf8 doesn't support 4 byte utf8 sequences. By looking at the wiki, 4 byte utf8 sequences are those > 0xFFFF, so that in utf16 require two char (that are named surrogate pairs). This method remove surrogate pairs characters. When found "coupled" (a high + a low surrogate pair), then a single replacementCharacter is substituted, otherwise a orphan (wrong) high or a low surrogate pair is replaced by a replacementCharacte.

    public static string RemoveSurrogatePairs(string str, string replacementCharacter = "?")
    {
        if (str == null)
        {
            return null;
        }
    
        StringBuilder sb = null;
    
        for (int i = 0; i < str.Length; i++)
        {
            char ch = str[i];
    
            if (char.IsSurrogate(ch))
            {
                if (sb == null)
                {
                    sb = new StringBuilder(str, 0, i, str.Length);
                }
    
                sb.Append(replacementCharacter);
    
                // If there is a high+low surrogate, skip the low surrogate
                if (i + 1 < str.Length && char.IsHighSurrogate(ch) && char.IsLowSurrogate(str[i + 1]))
                {
                    i++;
                }
            }
            else if (sb != null)
            {
                sb.Append(ch);
            }
        }
    
        return sb == null ? str : sb.ToString();
    }