Search code examples
sqlsql-serverdistinct

Why does SELECT Distinct for a SQL column behave unexpectedly for the integer value of 0?


I ran this on SQL Server Express:

DECLARE @Logs table(Id INT, Num INT)
-- For VALUES (1,1),(2,x), (3,x),(4,x), version 1 and version 2 return the same value. EXCEPT IF x=0...Why????
INSERT INTO @Logs (Id, Num) VALUES (1,1),(2,0), (3,0),(4,0) --LINE 2

---------Version 1-------------
;WITH table2(Num,rn) as (
    SELECT Num, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) as rn from @Logs
)
SELECT DISTINCT(t3.Num) from table2 as t3 JOIN ( --DISTINCT on this line is only difference
    SELECT t1.Num, t2.rn as last_rn, t1.rn as second_rn from
    table2 as t1 JOIN table2 as t2 ON t1.rn = t2.rn-1 AND t1.Num = t2.Num
) as t4 ON t4.second_rn - 1 = t3.rn AND t3.Num = t4.Num

-----------Version 2 ------------
;WITH table2(Num,rn) as (
    SELECT Num, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) as rn from @Logs
)

SELECT t3.Num as ConsecutiveNums from table2 as t3 JOIN (
    SELECT t1.Num, t2.rn as last_rn, t1.rn as second_rn from
    table2 as t1 JOIN table2 as t2 ON t1.rn = t2.rn-1 AND t1.Num = t2.Num
) as t4 ON t4.second_rn - 1 = t3.rn AND t3.Num = t4.Num

Version 1 returns 1 single row with the value 0. Version 2 returns no rows. However, the only difference between the two is the keyword DISTINCT at the beginning of the SELECT statement.

However, when I replace the 0's in INSERT INTO @Logs (Line 2), to any number other than 0, Version 1 and Version 2 return the same single row.

Why??? I checked that a simple SELECT DISTINCT does not have this same bug on the number zero ( SELECT DISTINCT(n) from (VALUES (1),(2),(0),(3),(2)) as t(n) behaves as expected)


Solution

  • There are a couple of things going on here.

    First, SQL tables and result sets represent unordered sets. There is no ordering, unless there is an explicit ORDER BY.

    So, as Jeroen points out in a comment, the CTE has a non-deterministic ordering.

    Second, SQL Server does not materialize subqueries. It runs the subquery each time it is referenced. Now, combine this with the first point, and you realize that the CTE can return different results each time it is referenced in a query, depending on the vagaries of the optimizer and the underlying execution engine.

    So, that is what is happening. SQL Server chooses different execution plans for the CTE each time it is referenced, resulting in incompatible data -- and different results. I will note that this is true in SQL Server as well as SQL Server Express (see here).

    Changing the type of join to a left join does not fix the problem. It might happen to generate the intended execution plan, but that would be hiding the problem.

    The correct change is to fix the order by, presumably in this case by using order by id:

    WITH table2(Num,rn) as (
          SELECT Num, ROW_NUMBER() OVER (ORDER BY (SELECT id)) as rn 
          FROM @Logs
         )
    

    Note that these types of bugs caused by "ties" in ORDER BY (often but not always with window functions) can be very hard to find and debug. I would urge you to pay attention to ordering and always be thinking "is the key unique, should I be adding a primary key as the last order by key".