Search code examples
sql-servert-sqlsql-order-byvarchar

T-SQL ORDER BY ignores " '-' + ... " but not " '+' + ... "


So i recently encountered a wierd bug when comparing two values.

My values was a range from -1 to 2. Sometimes it thought that -1 was bigger than 0, the solution was easy. Apparently was the column set to varchar(50) instead of int.

But this made me think why this happened. Because even if the column was set to varchar(50) the '-' should have a lower char value than '0' (charvalue for '-' is 45 and charvalue for '0' should be 48)

I made some tests and it turns out, what i can find, that '-' is the only character that ORDER BY doesn't care about.

Example:

SELECT
    A.x
FROM
    (
        VALUES
            ('-5'), ('-4'), ('-3'), ('-2'), ('-1'),
            ('0'), ('1'), ('2'), ('3'), ('4'), ('5')
    ) A(x)
ORDER BY
    A.x;

SELECT
    B.x
FROM
    (
        VALUES
            ('+5'), ('+4'), ('+3'), ('+2'), ('+1'),
            ('0'), ('1'), ('2'), ('3'), ('4'), ('5')
    ) B(x)
ORDER BY
    B.x

Result:

Result of A
0
1
-1
2
-2
3
-3
4
-4
5
-5

Result of B
+1
+2
+3
+4
+5
0
1
2
3
4
5

(+ has a charvalue of 43)

The '+' order by feels right but the '-' seems... wrong
Anyone knows why it is like this?

Additional info

Server version: 12.0.4213
Collation: Finnish_Swedish_CI_AS

No clue what else could skew the result. Ask if you need more information.


Solution

  • Found out why.

    TLDR: Non-unicode and unicode collation sorts '-' differently.

    "A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules. Because the comparison rules for non-Unicode and Unicode data are different, when you use a SQL collation you might see different results for comparisons of the same characters, depending on the underlying data type. For example, if you are using the SQL collation "SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen."

    Source: https://support.microsoft.com/en-us/kb/322112