Search code examples
c#stringunicodeindexofsurrogate-pairs

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?


I am trying to figure out an equivalent to C# string.IndexOf(string) that can handle surrogate pairs in Unicode characters.

I am able to get the index when only comparing single characters, like in the code below:

    public static int UnicodeIndexOf(this string input, string find)
    {
        return input.ToTextElements().ToList().IndexOf(find);
    }

    public static IEnumerable<string> ToTextElements(this string input)
    {
        var e = StringInfo.GetTextElementEnumerator(input);
        while (e.MoveNext())
        {
            yield return e.GetTextElement();
        }
    }

But if I try to actually use a string as the find variable then it won't work because each text element only contains a single character to compare against.

Are there any suggestions as to how to go about writing this?

Thanks for any and all help.

EDIT:

Below is an example of why this is necessary:

CODE

 Console.WriteLine("HolyCow𪘁BUBBYY𪘁YY𪘁Y".IndexOf("BUBB"));
 Console.WriteLine("HolyCow@BUBBYY@YY@Y".IndexOf("BUBB"));

OUTPUT

9
8

Notice where I replace the 𪘁 character with @ the values change.


Solution

  • You basically want to find index of one string array in another string array. We can adapt code from this question for that:

    public static class Extensions {
        public static int UnicodeIndexOf(this string input, string find, StringComparison comparison = StringComparison.CurrentCulture) {
            return IndexOf(
               // split input by code points
               input.ToTextElements().ToArray(),
               // split searched value by code points
               find.ToTextElements().ToArray(), 
               comparison);
        }
        // code from another answer
        private static int IndexOf(string[] haystack, string[] needle, StringComparison comparision) {
            var len = needle.Length;
            var limit = haystack.Length - len;
            for (var i = 0; i <= limit; i++) {
                var k = 0;
                for (; k < len; k++) {
                    if (!String.Equals(needle[k], haystack[i + k], comparision)) break;
                }
    
                if (k == len) return i;
            }
    
            return -1;
        }
    
        public static IEnumerable<string> ToTextElements(this string input) {
            var e = StringInfo.GetTextElementEnumerator(input);
            while (e.MoveNext()) {
                yield return e.GetTextElement();
            }
        }
    }