I have a table that contains some data. There are some attributes that are used to 'group' the rows and there is a date column. I'm trying to improve the performance of a query that basically gets TOP X rows from this table, where it returns only the latest row if the data is in the past + all future rows.
My data:
Id | GroupingId | Date | Whatever |
---|---|---|---|
1 | 1 | 2023-01-01 | Value1 |
2 | 1 | 2023-01-02 | Value2 |
3 | 2 | 2023-01-03 | Value3 |
4 | 1 | 2040-01-01 | Value1 |
My query:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY GroupingId
ORDER BY Date DESC)
as rnk
FROM myData
WHERE Date <= SYSUTCDATETIME()
UNION
SELECT *, 1 as rnk
FROM myData
WHERE Date > SYSUTCDATETIME()
)
SELECT *
FROM cte
WHERE rnk = 1
ORDER BY GroupingId
OFFSET 0 ROWS Fetch NEXT 100 ROWS ONLY
My expected output:
Id | GroupingId | Date | Whatever |
---|---|---|---|
2 | 1 | 2023-01-02 | Value2 |
4 | 1 | 2040-01-01 | Value1 |
3 | 2 | 2023-01-03 | Value3 |
Note that the above is huge simplification of the actual data structure. But This seems to be the culprit. Basically, due to the order by (and possibly filtering conditions that might show up), sql server needs to 'materialize' (aka read from disk) the whole first part of the cte. I'm looking for a way to modify the query/index strategy/indexed view so that the query can be executed without loading everything into memory and just reading top X rows from index/whatever. In essence, this query needs to execute instantly, even if the table has many milions of rows.
EDIT 1:
Execution plan for the above here
Index that is already there:
CREATE NONCLUSTERED INDEX [idx_groupingId_date_test] ON [dbo].[myData]
(
[GroupingId] ASC,
[Date] DESC
)
Some other useful info:
Yes, the data is distributed so that there is much more data to be expected in the past. E.g. for every GroupingId
there might be couple of rows in the future but dozens/hundreds in the past.
Removing the union makes the query super fast. This is actually the recent change that caused the performance problem that I'm trying to fix.
Ok, some comments got me thinking and on a right track. I figured, I basically need a way to get rid of the Union/merge join from the query plan and have nested loops, so the query can be executed sequentially all the way.
SELECT TOP 100 *
FROM (
SELECT *
FROM mydata o
WHERE (o.date > SYSUTCDATETIME() OR (
o.id = (SELECT TOP 1 id FROM mydata s WHERE s.groupingid = o.groupingid AND s.date < SYSUTCDATETIME() ORDER BY DATE DESC)))
) r
ORDER BY r.groupingid
This gets me from 1 minute in the original query (or the other answers so far) to below 100ms. New query plan: https://www.brentozar.com/pastetheplan/?id=r1sQdhM0s