I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:
public static void UnicodeTest()
{
var highUnicodeChar = "𝐀"; //Not the standard A
var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns \ud835
}
When I assign highUnicodeChar
to result1
directly, it retains its literal value of 𝐀
. When I try to access it by index, it returns \ud835
. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char
to a string
.
In the end, I want result2
to yield the same value as result1
. How can I do this?
In Unicode, you have code points. These are 21 bits long. Your character 𝐀, Mathematical Bold Capital A
, has a code point of U+1D400.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "𝐀";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.