Search code examples
sql-servert-sqlcommon-table-expressionwindow-functionsrecursive-query

Unexpected data at typical recursion


It's hard for me to use words to describe this, so here's the sample:

select *
into t
from (values (10, 'A'),
             (25, 'B'),
             (30, 'C'),
             (45, 'D'),
             (52, 'E'),
             (61, 'F'),
             (61, 'G'),
             (61, 'H'),
             (79, 'I'),
             (82, 'J')
) v(userid, name)

Notice how F,G and H have the same userid.

Now, consider the following recursive query:

with tn as 
(
    select t.userId,t.name, row_number() over (order by userid,newid()) as seqnum
    from t
),
cte as 
(
        select userId, name,  seqnum as seqnum
        from tn 
        where seqnum = 1
    union all
        select tn.userId, tn.name,tn.seqnum
        from 
            cte 
            inner join tn on tn.seqnum = cte.seqnum + 1
)
select *
from cte

The first cte,tn, creates a row_number which includes a randomization component that will cause F/G/H to appear in random order. However, they will all still appear once each(easily checkable by modifying the last and outermost from to be from tn)

The second cte, cte, recursively scans tn. There's nothing too complicated in it because it comes from a minimized example. However, it is evident that the anchor member is manually set to be tn's first line, and then the recursion scans all the rest lines.

However, the final result-set does not have F/G/H appear once each! They all appear in 3 lines total, but in any combination. FGH is possible, but so is FFH, and even FFF! Here's an FHH example:

+--------+------+--------+
| userId | name | seqnum |
+--------+------+--------+
|     10 | A    |      1 |
|     25 | B    |      2 |
|     30 | C    |      3 |
|     45 | D    |      4 |
|     52 | E    |      5 |
|     61 | F    |      6 |
|     61 | H    |      7 |
|     61 | H    |      8 |
|     79 | I    |      9 |
|     82 | J    |     10 |
+--------+------+--------+

Why is that?

I don't think analytic function behaviour within ctes has anything to do with it, because tn which includes the row_number is not recursive.

For the record, I got this question due to the following sequence of events: Someone asked a question that was answered by an excellent contributor. The same OP asked a follow-up question, which I though I could answer with a bit of tampering on the original. However, after digging a bit, I found a behaviour I cannot understand. I made the example minimal as much as i could.


Solution

  • CTE's aren't spooled. Every reference to tn may result in re-running the query and re-randomizing the results.

    To avoid this, run the randomizing query once and load a temp table. eg

    select t.userId,t.name, row_number() over (order by userid,newid()) as seqnum
    into #tn
    from t
    

    and reference that in your subsequent query

    with tn as 
    (
        select * from #tn
    ),
    cte as 
    (
            select userId, name,  seqnum as seqnum
            from tn
            where seqnum = 1
        union all
            select tn.userId, tn.name,tn.seqnum
            from 
                cte 
                inner join tn on tn.seqnum = cte.seqnum + 1
    )
    select *
    from cte