I have a list of character range restrictions that I need to check a string against, but the char
type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char
's in a string
, I don't get the 32-bit Unicode code points and some comparisons with high values fail.
I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...
How would you convert a string
to an array (int[]
) of 32-bit Unicode code points?
This answer is not correct. See @Virtlink's answer for the correct one.
static int[] ExtractScalars(string s)
{
if (!s.IsNormalized())
{
s = s.Normalize();
}
List<int> chars = new List<int>((s.Length * 3) / 2);
var ee = StringInfo.GetTextElementEnumerator(s);
while (ee.MoveNext())
{
string e = ee.GetTextElement();
chars.Add(char.ConvertToUtf32(e, 0));
}
return chars.ToArray();
}
Notes: Normalization is required to deal with composite characters.