Search code examples
c#stringunicodecharastral-plane

How would you get an array of Unicode code points from a .NET String?


I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?


Solution

  • This answer is not correct. See @Virtlink's answer for the correct one.

    static int[] ExtractScalars(string s)
    {
      if (!s.IsNormalized())
      {
        s = s.Normalize();
      }
    
      List<int> chars = new List<int>((s.Length * 3) / 2);
    
      var ee = StringInfo.GetTextElementEnumerator(s);
    
      while (ee.MoveNext())
      {
        string e = ee.GetTextElement();
        chars.Add(char.ConvertToUtf32(e, 0));
      }
    
      return chars.ToArray();
    }
    

    Notes: Normalization is required to deal with composite characters.