Search code examples
c#.net-coreunicodeicu

ß != ss for case insensitive comparison with ICU


The following C# code returns false when using .NET 6 which uses the ICU library for its string comparisons:

Thread.CurrentThread.CurrentCulture = new CultureInfo("de-de");
Thread.CurrentThread.CurrentUICulture = new CultureInfo("de-de");
"ß".Equals("SS", StringComparison.CurrentCultureIgnoreCase);  // false with ICU

From my understanding this should be true based on the Unicode case folding rules (and also - somewhat more importantly - standard German orthographic rules):

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S (CaseFolding.txt)

When using the legacy Microsoft NLS implementation, the above code returns true.

So why does the ICU library used by .NET 6 differ from the Unicode standard or is my understanding of the standard incorrect here?

Original C# question that lead to this.


Solution

  • Unicode is complicated.

    Turns out the behavior is intentional. See this Github issue on the matter where tarekgh summarizes the issue:

    ICU collation work using what it is called collation strength. Strength can be Primary, Secondary, Tertiary, or Quaternary. We are trying to map as much as we can the .NET comparison options to one of these strength. which work fine except in such special cases. Unfortunately, ICU make ß equals only to ss if having the ICU strength is primary. We cannot switch to that strength by default in .NET because is going to break many other things.

    The default collation strength used by .NET is tertiary and secondary for case insensitive comparisons (as far as I can tell from code).

    The workaround is to use StringComparer.Create(CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) - the CompareOptions.IgnoreNonSpace forces a primary collation strength in which case ß and ss will compare equal.

    There are probably a few unintended side effects from that switch but at least the German speakers will be happy.