Search code examples
c#compareto

How C# compareto method compares Strings


I want to know how CompareTo method of C# compares two strings, so I tested like this :

string str1 = "0";
string str2 = "-";
Console.WriteLine(str1.CompareTo(str2)); // output : 1
string str3 = "01";
string str4 = "-1";
Console.WriteLine(str3.CompareTo(str4)); // output : -1

Why the results are different?


Solution

  • TLDR: The default lexicographical string ordering treats - characters specially.

    The answer to this is that the default string comparison uses lexicographical sorting rules.

    This means that some symbols - for example, -, are treated specially.

    The documentation for CompareOptions states:

    The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

    In your case, the default ordering is being used: Word sort.

    You can see the different results by specifying the kind of comparison you want in string.Compare():

    string str3 = "01";
    string str4 = "-1";
    Console.WriteLine(Math.Sign(string.Compare(str3, str4, StringComparison.InvariantCulture))); // output : -1
    Console.WriteLine(Math.Sign(string.Compare(str3, str4, StringComparison.Ordinal)));          // output : 1
    

    Here you can see that it is treating the - specially when not doing an Ordinal comparison.

    It really is the - that's being treated specially - it's not assuming it's a minus sign. For example, if you use + instead of - you get:

    string str1 = "0";
    string str2 = "+";
    Console.WriteLine(Math.Sign(string.Compare(str1, str2, StringComparison.InvariantCulture))); // output : 1
    Console.WriteLine(Math.Sign(string.Compare(str1, str2, StringComparison.Ordinal)));          // output : 1
    string str3 = "01";
    string str4 = "+1";
    Console.WriteLine(Math.Sign(string.Compare(str3, str4, StringComparison.InvariantCulture))); // output : 1
    Console.WriteLine(Math.Sign(string.Compare(str3, str4, StringComparison.Ordinal)));          // output : 1
    

    ASIDE

    Do not confuse a normal hyphen with a soft hyphen!

    • A normal hyphen has the Unicode value \u002D.
    • A soft hyphen has the Unicode value \u00AD.

    Note the documentation for string.Compare() which has sample code that shows a soft hyphen being ignored. The documentation states:

    Character sets include ignorable characters. The Compare(String, String, Boolean) method does not consider such characters when it performs a culture-sensitive comparison.

    A soft hyphen is one of the ignorable characters, but it is important to note that a soft hyphen is NOT THE SAME AS a normal hyphen. So this documentation DOES NOT APPLY to your sample code.

    The actual reason for the normal hyphen behaving differently is given above.

    (If you want a complete list of all ignorable characters in Unicode, go to http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt and search for Default_Ignorable_Code_Point - and note that this list does not in fact include the normal hyphen.)