Search code examples
sqlsql-servergroup-byquery-optimization

SQL Group By Count Filtering Optimization


I'm wondering if group by count with filter works optimally, meaning that let's assume we have following query:

SELECT TenantId FROM SomeTable GROUP BY TenantId HAVING Count(*) >= 2

Are all elements counted for each group and then groups get filtered or there will be an optimization so as soon as some of the groups has more than 2 elements remaining elements will not be counted and group will be included in the final result?

And if there is no optimization would following query work better assuming that each group has a lot of rows:

SELECT [t1].TenantId FROM (
  SELECT [t0].TenantId FROM SomeTable AS [t0]
  GROUP BY [t0].TenantId) AS [t1]
WHERE (
  SELECT COUNT(*)
  FROM (
  SELECT TOP(2) NULL AS [EMPTY]
  FROM SomeTable AS [t2]
  WHERE [t1].TenantId = [t2].TenantId
  ) AS [t3]
) >= 2

Solution

  • To be able to do what you ask, you would need skip-scanning. In other words, the server needs to read from each group just the necessary number of rows, then skip and seek to the next group.

    Unfortunately, SQL Server does not implement skip-scanning. This is partly to do with the fact that it is not actually as useful as you might imagine, because constantly skipping across the index is often less efficient than just scanning the whole thiing in the first place.


    As you say, it can depend on the number of rows in each group. If, and only if, the size of the groups is very large relative to the HAVING COUNT(*) clause, then it may be worthwhile implementing a skip-scan yourself in order to get distinct values, as show by Paul White in this great article. It can be done with difficulty with a recursive CTE, but I'm not sure how it will optimize given you are grouping as well.

    So I will show you the iterative table-variable solution, which should be just as fast. All credits to Paul White for this technique

    DECLARE @Results TABLE (TenantId int NOT NULL, cnt int NOT NULL);
    DECLARE @next int, @cnt int; 
    
    SELECT TOP (1)
           @tenantId = TenantId, @cnt = COUNT(*)
        FROM SomeTable
        GROUP BY TenantId
        ORDER BY TenantId;
    
    WHILE (@cnt > 0)
    BEGIN
        INSERT @Results (TenantId, cnt) VALUES (@TenantId, @cnt);
        
        SELECT TOP (1)
               @tenantId = TenantId, @cnt = COUNT(*)
            FROM SomeTable
            WHERE TenantId > @tenantId
            GROUP BY TenantId
            ORDER BY TenantId;
    END;
    
    SELECT TenantId
    FROM @Results
    WHERE cnt >= 2;