I have a C# method that needs to retrieve the first character of a string, and see if it exists in a HashSet that contains specific unicode characters (all the right-to-left characters).
So I'm doing
var c = str[0];
and then checking the hashset.
The problem is that this code doesn't work for strings where the first char's code point is larger than 65535.
I actually created a loop that goes through all numbers from 0 to 70,000 (the highest RTL code point is around 68,000 so I rounded up), I create a byte array from the number, and use
Encoding.UTF32.GetString(intValue);
to create a string with this character. I then pass it to the method that searches in the HashSet, and that method fails, because when it gets
str[0]
that value is never what it should be.
What am I doing wrong?
To anyone who sees this question in the future and is interested in the solution I ended up with - this is my method which decides if a string should be displayed RTL or LTR based on the first character in the string. It takes UTF-16 Surrogate Pairs into account.
Thanks to Tom Blodget who pointed me in the right direction.
if (string.IsNullOrEmpty(str)) return null;
var firstChar = str[0];
if (firstChar >= 0xd800 && firstChar <= 0xdfff)
{
// if the first character is between 0xD800 - 0xDFFF, this is the beginning
// of a UTF-16 surrogate pair. there MUST be one more char after this one,
// in the range 0xDC00-0xDFFF.
// for the very unreasonable chance that this is a corrupt UTF-16 string
// and there is no second character, validate the string length
if (str.Length == 1) return FlowDirection.LeftToRight;
// convert surrogate pair to a 32 bit number, and check the codepoint table
var highSurrogate = firstChar - 0xd800;
var lowSurrogate = str[1] - 0xdc00;
var codepoint = (highSurrogate << 10) + (lowSurrogate) + 0x10000;
return _codePoints.Contains(codepoint)
? FlowDirection.RightToLeft
: FlowDirection.LeftToRight;
}
return _codePoints.Contains(firstChar)
? FlowDirection.RightToLeft
: FlowDirection.LeftToRight;