Search code examples
c#stringreverseutf-16surrogate-pairs

How to reverse a string that contains surrogate pairs


I have written this method to reverse a string

public string Reverse(string s)
        {
            if(string.IsNullOrEmpty(s)) 
                return s;

            TextElementEnumerator enumerator =
               StringInfo.GetTextElementEnumerator(s);

            var elements = new List<char>();
            while (enumerator.MoveNext())
            {
                var cs = enumerator.GetTextElement().ToCharArray();
                if (cs.Length > 1)
                {
                    elements.AddRange(cs.Reverse());
                }
                else
                {
                    elements.AddRange(cs);
                }
            }

            elements.Reverse();
            return string.Concat(elements);
        }

Now, I don't want to start a discussion about how this code could be made more efficient or how there are one liners that I could use instead. I'm aware that you can perform Xors and all sorts of other things to potentially improve this code. If I want to refactor the code later I could do that easily as I have unit tests.

Currently, this correctly reverses BML strings (including strings with accents like "Les Misérables") and strings that contain combined characters such as "Les Mise\u0301rables".

My test that contains surrogate pairs work if they are expressed like this

Assert.AreEqual("𠈓", _stringOperations.Reverse("𠈓"));

But if I express surrogate pairs like this

Assert.AreEqual("\u10000", _stringOperations.Reverse("\u10000"));

then the test fails. Is there an air-tight implementation that supports surrogate pairs as well?

If I have made any mistake above then please do point this out as I'm no Unicode expert.


Solution

  • \u10000 is a string of two characters: က (Unicode code point 1000) followed by a 0 (which can be detected by inspecting the value of s in your method). If you reverse two characters, they won't match the input anymore.

    It seems you're after Unicode Character 'LINEAR B SYLLABLE B008 A' (U+10000) with hexadecimal code point 10000. From Unicode character escape sequences on MSDN:

    \u hex-digit hex-digit hex-digit hex-digit

    \U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

    So you'll have to use either four or eight digits.

    Use \U00010000 (notice the capital U) or \uD800\uDC00 instead of \u10000.