Search code examples
c#unicodeutf-16utf-32

C#: read the first char of a string, when that char's unicode value is > 65535


I have a C# method that needs to retrieve the first character of a string, and see if it exists in a HashSet that contains specific unicode characters (all the right-to-left characters).

So I'm doing

var c = str[0];

and then checking the hashset.

The problem is that this code doesn't work for strings where the first char's code point is larger than 65535.

I actually created a loop that goes through all numbers from 0 to 70,000 (the highest RTL code point is around 68,000 so I rounded up), I create a byte array from the number, and use

Encoding.UTF32.GetString(intValue);

to create a string with this character. I then pass it to the method that searches in the HashSet, and that method fails, because when it gets

str[0]

that value is never what it should be.

What am I doing wrong?


Solution

  • To anyone who sees this question in the future and is interested in the solution I ended up with - this is my method which decides if a string should be displayed RTL or LTR based on the first character in the string. It takes UTF-16 Surrogate Pairs into account.

    Thanks to Tom Blodget who pointed me in the right direction.

    if (string.IsNullOrEmpty(str)) return null;
    
    var firstChar = str[0];
    if (firstChar >= 0xd800 && firstChar <= 0xdfff)
    {
        // if the first character is between 0xD800 - 0xDFFF, this is the beginning
        // of a UTF-16 surrogate pair. there MUST be one more char after this one,
        // in the range 0xDC00-0xDFFF. 
        // for the very unreasonable chance that this is a corrupt UTF-16 string
        // and there is no second character, validate the string length
        if (str.Length == 1) return FlowDirection.LeftToRight;
    
        // convert surrogate pair to a 32 bit number, and check the codepoint table
        var highSurrogate = firstChar - 0xd800;
        var lowSurrogate = str[1] - 0xdc00;
        var codepoint = (highSurrogate << 10) + (lowSurrogate) + 0x10000;
    
        return _codePoints.Contains(codepoint)
            ? FlowDirection.RightToLeft
            : FlowDirection.LeftToRight;
    }
    return _codePoints.Contains(firstChar)
        ? FlowDirection.RightToLeft
        : FlowDirection.LeftToRight;