There are many ways to compare two strings to find the first index where they differ, but if I require case-insensitivity in any given culture, then the options go away.
This is the only way I can think to do such a comparison:
static int FirstDiff(string str1, string str2)
{
for (int i = 1; i <= str1.Length && i <= str2.Length; i++)
if (!string.Equals(str1.Substring(0, i), str2.Substring(0, i), StringComparison.CurrentCultureIgnoreCase))
return i - 1;
return -1; // strings are identical
}
Can anyone think of a better way that doesn't involve so much string allocation?
For testing purposes:
// Turkish word 'open' contains the letter 'ı' which is the lowercase of 'I' in Turkish, but not English
string lowerCase = "açık";
string upperCase = "AÇIK";
Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
FirstDiff(lowerCase, upperCase); // Should return 2
Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
FirstDiff(lowerCase, upperCase); // Should return -1
Edit: Checking both ToUpper and ToLower for each character appears to work for any example that I can come up with. But will it work for all cultures? Perhaps this is a question better directed at linguists.
I am reminded of one additional oddity of characters (or rather unicode code points): there are some that act as surrogate pairs and they have no relevance to any culture unless the pair appears next to one another. For more information about Unicode interpretation standards see the document that @orhtej2 linked in his comment.
While trying out different solutions I stumbled upon this particular class, and I think it will best suit my needs: System.Globalization.StringInfo
(The MS Doc Example shows it in action with surrogate pairs)
The class breaks the string down into sections by pieces that need each other to make sense (rather than by strictly character). I can then compare each piece by culture using string.Equals
and return the index of the first piece that differs:
static int FirstDiff(string str1, string str2)
{
var si1 = StringInfo.GetTextElementEnumerator(str1);
var si2 = StringInfo.GetTextElementEnumerator(str2);
bool more1, more2;
while ((more1 = si1.MoveNext()) & (more2 = si2.MoveNext())) // single & to avoid short circuiting the right counterpart
if (!string.Equals(si1.Current as string, si2.Current as string, StringComparison.CurrentCultureIgnoreCase))
return si1.ElementIndex;
if (more1 || more2)
return si1.ElementIndex;
else
return -1; // strings are equivalent
}