sql sql-server sql-server-2016 estimation

Selectivity estimation error on a simple query

Let us have a simple table tt created like this

WITH x AS (SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)), t1 AS
(
  SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n as id  
  FROM x ones,     x tens,      x hundreds,       x thousands,       x tenthousands,       x hundredthousands
)
SELECT  id,
        id % 100 groupby,
        row_number() over (partition by id % 100 order by id) orderby,
        row_number() over (partition by id % 100 order by id) / (id % 100 + 1) local_search
INTO tt
FROM t1

I have a simple query Q1:

select distinct g1.groupby,
        (select count(*) from tt g2 
         where local_search = 1 and g1.groupby = g2.groupby) as orderby
from tt g1
option(maxdop 1)

I would like to know why the SQL Server estimates the result size so badly for Q1 (see the printscreen). Most of the operators in the query plan are estimated precisely, however, at the the root Hash Match operator introduce completely insane guess.

To make it more interesting I have tried different rewritings of the Q1. If I apply decorrelation of the subquery I obtain an equivalent query Q2:

select main.groupby, 
       coalesce(sub1.orderby,0) orderby
from
(
    select distinct g1.groupby
    from tt g1
) main
left join
(
    select groupby, count(*) orderby
    from tt g2 
    where local_search = 1
    group by groupby
) sub1 on sub1.groupby = main.groupby
option(maxdop 1)

This query is interesting in two aspects: (1) the estimation is accurate (see printscreen), (2) it has also different query plan, which is more efficient that the query plan of Q1.

So the question is: why the estimation of Q1 is inccorect, whereas the estimation of Q2 is precise? Please do not post other rewritings of this SQL (I know that that is can be written even without subqueries), I'm interested only about the explanation of the selectivity estimator behaviour. Thanks.

Solution

It doesn't recognize the orderby value will be the same for all rows with the same groupby so it thinks distinct groupby, orderby will have more combinations than just distinct groupby.

It multiplies the estimate for DISTINCT orderby (for me this is 35.0367) and the estimate for DISTINCT groupby (for me this is 100) as if they were uncorrelated.

I get an estimate for 3503.67 for the root node in Q1

This rewrite avoids it as it now only groups by the single groupby column.

SELECT groupby,
       max(orderby) AS orderby
FROM   (SELECT g1.groupby,
               (SELECT count(*)
                FROM   tt g2
                WHERE  local_search = 1
                       AND g1.groupby = g2.groupby) AS orderby
        FROM   tt g1) d
GROUP  BY groupby
OPTION(maxdop 1)

This is not the optimal approach to this query though as shown by your Q2 and the comment @GarethD makes about the inefficiencies of running the correlated sub query multiple times and discarding the duplicates.