I have a Very Large/slow table (35gb data, 70gb indexes) on an Azure SQL MI that's roughly this:
create table a (id int identity, mydate datetime, fielda varchar(10), fieldb money, fieldc char(50), fieldd datetime)
create index ncidx__a__mydate on a (mydate)
I want to get a list of which hours have 0 rows, though which hours have any rows is close enough (bang it against a numbers table and you can get a list). I was trying to think of the bare minimum needed.
I could do something like this, but would need 30*24 of them for each month. Seems like something a function with a cross-apply could do.
SELECT * FROM (SELECT '20240703 1am test' res) a WHERE EXISTS (SELECT * FROM dbo.mytable WHERE mydate >='20240701 01:00:00' AND mydate <'20240701 02:00:00')
I came up with this (based off an Itzik Ben-Gan query), which... works, and is good enough for me for now. (3 minutes for a month, 200m rows, 34gb data, 70gb of indexes including the one you see. It does 715992 logical reads).
SELECT orderday, orderhour, MIN(mydate) AS mindate
FROM mytable
CROSS APPLY (VALUES(day(mydate), datepart(HOUR,mydate)))
AS D(orderday, orderhour)
WHERE mydate >='20240501' AND mydate <'20240601'
GROUP BY orderday, orderhour
But is there a faster way? I don't need a min date, I just need to know if there are ANY rows in there. It seems like it could be almost instant, but the logic of doing that is beyond me. Thanks.
You are on the right track with the mydate
index. Most solutions would likely involve either a loop join and an index seek (probe) into that index, or a range scan of the relevant portion of the index (better than a table scan) that is fed to logic that looks for gaps.
The most obvious and possibly the most efficient solution would be to generate a range of data/time values that is then filtered by a where not exists(...)
clause. Something like:
declare @FromDate datetime = '2024-07-01'
declare @ToDate datetime = dateadd(month, 1, @FromDate) -- Exclusive
declare @DiffHours int = datediff(hour, @FromDate, @ToDate)
-- Using not exists
-- Loop join and index seek (probe)
select h.Hour
from generate_series(0, @DiffHours - 1) s
cross apply (select dateadd(hour, s.value, @FromDate) as Hour) h
where not exists (
select *
from a
where a.mydate >= h.Hour
and a.mydate < dateadd(hour, 1, h.Hour)
)
order by h.Hour
For a one month range, the index would be probed by the exists
clause about 750 times.
Another approach would be to scan the date range, reduce it to a set of distinct hours values and then subtract those values from a generated list of all hours. The subtract operation could be performed using except
or with a left join ... where right-side is null
construct.
declare @FromDate datetime = '2024-07-01'
declare @ToDate datetime = dateadd(month, 1, @FromDate) -- Exclusive
declare @DiffHours int = datediff(hour, @FromDate, @ToDate)
-- Using except
-- Index seek (range scan) and merge join (or spool + loop join)
select h.Hour
from generate_series(0, @DiffHours - 1) s
cross apply (select dateadd(hour, s.value, @FromDate) as Hour) h
except
select distinct datetrunc(hour, a.mydate) as Hour
from a
where a.mydate >= @FromDate
and a.mydate < @ToDate
order by Hour
-- Using left join
-- Index seek (range scan) and merge join (or spool + loop join)
-- (Slighly different from the "except" plan)
select h.Hour
from generate_series(0, @DiffHours - 1) s
cross apply (select dateadd(hour, s.value, @FromDate) as Hour) h
left join (
select distinct datetrunc(hour, a.mydate) as Hour
from a
where a.mydate >= @FromDate
and a.mydate < @ToDate
) h2
on h2.Hour = h.Hour
where h2.Hour is null
order by H.Hour
Finally, an ordered scan of the date range can be performed, checking for gaps using the LAG()
window function to compare the current value to the prior. A UNION ALL
is included to ensure that we have the end-cases (first and last hour) covered.
There might be a slight performance gain if the UNION
operations are factoring out to then next outer query level, since data is already sorted, but the cost is two more index seeks. Both versions follow.
declare @FromDate datetime = '2024-07-01'
declare @ToDate datetime = dateadd(month, 1, @FromDate) -- Exclusive
-- Index seek (range scan), using lag() to check for gaps.
-- A union-all is used to ensure we have the end-cases covered.
select
dd.MissingHours,
dateadd(hour, 1, dp.PriorHour) as FromHour,
dateadd(hour, -1, Hour) as ThruHour
from (
select
d.Hour,
lag(d.Hour) over(order by d.Hour) as PriorHour
from (
select dateadd(hour, -1, @FromDate) as Hour
union all
select @ToDate as Hour
union all
select datetrunc(hour, a.mydate) as Hour
from a
where a.mydate >= @FromDate
and a.mydate < @ToDate
) d
) dp
cross apply (select datediff(hour, dp.PriorHour, dp.Hour) - 1 as MissingHours) dd
where dd.MissingHours >= 1
order by dp.PriorHour
-- Index seek (range scan), using lag() to check for gaps.
-- Factoring out this union /might/ yield a slight performance gain,
-- since data is already sorted, but the cist is two more index seeks.
select
dd.MissingHours,
dateadd(hour, 1, dp.PriorHour) as FromHour,
dateadd(hour, -1, Hour) as ThruHour
from (
select
d.Hour,
lag(d.Hour) over(order by d.Hour) as PriorHour
from (
select datetrunc(hour, a.mydate) as Hour
from a
where a.mydate >= @FromDate
and a.mydate < @ToDate
) d
union all
select
datetrunc(hour, min(a.mydate)) as Hour,
dateadd(hour, -1, @FromDate) as PriorHour
from a
where a.mydate >= @FromDate
and a.mydate < @ToDate
union all
select
@ToDate as Hour,
datetrunc(hour, max(a.mydate)) as PriorHour
from a
where a.mydate >= @FromDate
and a.mydate < @ToDate
) dp
cross apply (select datediff(hour, dp.PriorHour, dp.Hour) - 1 as MissingHours) dd
where dd.MissingHours >= 1
order by dp.PriorHour
The above yields ranges instead of individual rows for each missing hour. This might be desirable. If not, the above can be modified with a GENERATE_SERIES()
and DATEADD()
to expand each resulting range.
select dateadd(hour, s.value, dp.PriorHour) as Hour
from (
...
) dp
cross apply (select datediff(hour, dp.PriorHour, dp.Hour) - 1 as MissingHours) dd
cross apply generate_series(1, dd.MissingHours) s
where dd.MissingHours >= 1
order by Hour
See this db<>fiddle for a demo of each of the above with execution plans using some generated test data having 1-, 3-, and 5-hour gaps.
You can run performance tests against your live data to see what works best for you.
Sample results (discrete hours and hour ranges):
Hour |
---|
2024-07-05 04:00 |
2024-07-11 10:00 |
2024-07-11 11:00 |
2024-07-11 12:00 |
2024-07-11 13:00 |
2024-07-22 21:00 |
2024-07-22 22:00 |
2024-07-22 23:00 |
2024-07-23 00:00 |
2024-07-23 01:00 |
2024-07-23 02:00 |
MissingHours | FromHour | ThruHour |
---|---|---|
1 | 2024-07-05 04:00 | 2024-07-05 04:00 |
3 | 2024-07-11 10:00 | 2024-07-11 12:00 |
5 | 2024-07-22 21:00 | 2024-07-23 01:00 |
OLDER SQL SERVER VERSIONS
The above SQL solutions use the DATE_TRUNC()
and/or GENERATE_SERIES()
functions available in SQL Server 2022 and later.
For earlier versions of SQL Server, the DATE_TRUNC(hour, xxx)
function can be replaced with the calculation dateadd(hour, datediff(hour, 0, xxx), 0)
. The GENERATE_SERIES()
function can be replaced with a number generator wrapped up in a CTE or cross apply
. There are many techniques out there, but for this application, you can use a variation of the following:
cross apply (
select top (n) -- 1..n
row_number() over(order by (select null)) as value
from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) n1(n) -- Up to 10
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) n2(n) -- Up to 100
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) n3(n) -- Up to 1000
) s
The row_number()
function produces numbers starting at 1. Subtracting 1 and adjusting the top (n)
value is needed for zero-based or other-based ranges. Add more cross joins if more than 1000 values are needed.
See this db<>fiddle for a demo that works for earlier SQL Server versions back to at least 2014.