Search code examples
c#.netunicodeunicode-escapes

Convert Unicode surrogate pair to literal string


I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:

public static void UnicodeTest()
{
    var highUnicodeChar = "𝐀"; //Not the standard A

    var result1 = highUnicodeChar; //this works
    var result2 = highUnicodeChar[0].ToString(); // returns \ud835
}

When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns \ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.

In the end, I want result2 to yield the same value as result1. How can I do this?


Solution

  • In Unicode, you have code points. These are 21 bits long. Your character 𝐀, Mathematical Bold Capital A, has a code point of U+1D400.

    In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.

    In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.

    This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.

    So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:

    var highUnicodeChar = "𝐀";
    char a = highUnicodeChar[0]; // code unit 0xD835
    char b = highUnicodeChar[1]; // code unit 0xDC00
    

    Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.

    You can use IsSurrogatePair to test for a surrogate pair. For instance:

    string GetFullCodePointAtIndex(string s, int idx) =>
        s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
    

    Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.

    To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.