Search code examples
sqlsql-serverquery-optimization

How to make this query faster: TOP N over TOP 1 in past + all in future


I have a table that contains some data. There are some attributes that are used to 'group' the rows and there is a date column. I'm trying to improve the performance of a query that basically gets TOP X rows from this table, where it returns only the latest row if the data is in the past + all future rows.

My data:

Id GroupingId Date Whatever
1 1 2023-01-01 Value1
2 1 2023-01-02 Value2
3 2 2023-01-03 Value3
4 1 2040-01-01 Value1

My query:

    WITH cte AS (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY GroupingId
                ORDER BY Date DESC) 
            as rnk
        FROM myData
        WHERE Date <= SYSUTCDATETIME()
        UNION
        SELECT *, 1 as rnk
        FROM myData
        WHERE Date > SYSUTCDATETIME()
    )
    SELECT *
    FROM cte
    WHERE rnk = 1
    ORDER BY GroupingId
    OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY

My expected output:

Id GroupingId Date Whatever
2 1 2023-01-02 Value2
4 1 2040-01-01 Value1
3 2 2023-01-03 Value3

Note that the above is huge simplification of the actual data structure. But This seems to be the culprit. Basically, due to the order by (and possibly filtering conditions that might show up), sql server needs to 'materialize' (aka read from disk) the whole first part of the cte. I'm looking for a way to modify the query/index strategy/indexed view so that the query can be executed without loading everything into memory and just reading top X rows from index/whatever. In essence, this query needs to execute instantly, even if the table has many milions of rows.

EDIT 1:

Execution plan for the above here

Index that is already there:

CREATE NONCLUSTERED INDEX [idx_groupingId_date_test] ON [dbo].[myData]
(
    [GroupingId] ASC,
    [Date] DESC
)

Some other useful info:

Yes, the data is distributed so that there is much more data to be expected in the past. E.g. for every GroupingId there might be couple of rows in the future but dozens/hundreds in the past.

Removing the union makes the query super fast. This is actually the recent change that caused the performance problem that I'm trying to fix.


Solution

  • Ok, some comments got me thinking and on a right track. I figured, I basically need a way to get rid of the Union/merge join from the query plan and have nested loops, so the query can be executed sequentially all the way.

    SELECT TOP 100 * 
    FROM (
        SELECT *
        FROM mydata o
        WHERE (o.date > SYSUTCDATETIME() OR (
            o.id = (SELECT TOP 1 id FROM mydata s WHERE s.groupingid = o.groupingid AND s.date < SYSUTCDATETIME() ORDER BY DATE DESC)))
    ) r
    ORDER BY r.groupingid
    

    This gets me from 1 minute in the original query (or the other answers so far) to below 100ms. New query plan: https://www.brentozar.com/pastetheplan/?id=r1sQdhM0s