I have written this method to reverse a string
public string Reverse(string s)
{
if(string.IsNullOrEmpty(s))
return s;
TextElementEnumerator enumerator =
StringInfo.GetTextElementEnumerator(s);
var elements = new List<char>();
while (enumerator.MoveNext())
{
var cs = enumerator.GetTextElement().ToCharArray();
if (cs.Length > 1)
{
elements.AddRange(cs.Reverse());
}
else
{
elements.AddRange(cs);
}
}
elements.Reverse();
return string.Concat(elements);
}
Now, I don't want to start a discussion about how this code could be made more efficient or how there are one liners that I could use instead. I'm aware that you can perform Xors and all sorts of other things to potentially improve this code. If I want to refactor the code later I could do that easily as I have unit tests.
Currently, this correctly reverses BML strings (including strings with accents like "Les Misérables"
) and strings that contain combined characters such as "Les Mise\u0301rables"
.
My test that contains surrogate pairs work if they are expressed like this
Assert.AreEqual("𠈓", _stringOperations.Reverse("𠈓"));
But if I express surrogate pairs like this
Assert.AreEqual("\u10000", _stringOperations.Reverse("\u10000"));
then the test fails. Is there an air-tight implementation that supports surrogate pairs as well?
If I have made any mistake above then please do point this out as I'm no Unicode expert.
\u10000
is a string of two characters: က
(Unicode code point 1000) followed by a 0
(which can be detected by inspecting the value of s
in your method). If you reverse two characters, they won't match the input anymore.
It seems you're after Unicode Character 'LINEAR B SYLLABLE B008 A' (U+10000) with hexadecimal code point 10000. From Unicode character escape sequences on MSDN:
\u hex-digit hex-digit hex-digit hex-digit
\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit
So you'll have to use either four or eight digits.
Use \U00010000
(notice the capital U) or \uD800\uDC00
instead of \u10000
.