Search code examples
c#.netunicodesubstring

Length of substring matched by culture-sensitive String.IndexOf method


I tried writing a culture-aware string replacement method:

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

However, it chokes on Unicode combining characters:

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

To fix my code, I need to know that in the second example, String.IndexOf matched only one character (é) even though it searched for two (e\u0301). Similarly, I need to know that in the third example, String.IndexOf matched two characters (e\u0301) even though it only searched for one (é).

How can I determine the actual length of the substring matched by String.IndexOf?

NOTE: Performing Unicode normalization on text and oldValue (as suggested by James Keesey) would accommodate combining characters, but ligatures would still be a problem:

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief

Solution

  • You will need to directly call FindNLSString or FindNLSStringEx yourself. String.IndexOf uses FindNLSStringEx but all the information you need is available in FindNLSString.

    Here is an example of how to rewrite your Replace method that works against your test cases. Note that I am using the current user locale read up the API documentation if you want to use the system locale or provide your own. I am also passing in 0 for the flags which means it will use the default string comparison options for the locale, again the documentation can help you provide different options.

    public const int LOCALE_USER_DEFAULT = 0x0400;
    
    [DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
    internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);
    
    public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
    {
        int foundLength;
        int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
        return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
    }