Search code examples
sqlsql-serversql-server-2016

SQL Server sampling large volume of data per hour


Im using SQL Server 2016, and have a very large table containing millions of rows of data from different sources at irregular intervals over several years. The table cannot be altered, typical data looks like this -

Reading_ID    Source    Date                  Reading
==========    ======    ====                  =======
1             1         2023/01/01 00:04:00   7
2             1         2023/01/01 00:10:00   3
3             2         2023/01/01 00:15:00   8
4             1         2023/01/01 01:00:00   2
5             2         2023/01/01 01:03:00   15

The table has CONSTRAINT [PK_DATA_READINGS] PRIMARY KEY CLUSTERED ([Source] ASC, [Date] ASC). The SOURCE can be any number, its not fixed or known in advance. New sources can start at any time.

What I want to do is specify a date range and an interval in hours, then just get 1 reading from each source every X hours. i.e. in the above row 2 wouldn't be returned as its too close to row 1

I've tried something like the following -

DECLARE @Start_Date DATETIME = '2023/01/01 00:00:00',
        @End_Date DATETIME = '2023/02/01 00:00:00',
        @Interval_Hours = 4

;WITH HOURLY_DATA AS (
     SELECT d.Source, 
            d.Date, 
            d.Reading,
            ROW_NUMBER() OVER (PARTITION BY d.Source, DATEDIFF(HOUR, @Start_Date, d.DATE) / @Interval_Hours ORDER BY d.SOURCE, d.DATE) AS SOURCE_HOUR_ROW
     FROM data_readings d
     WHERE d.DATE BETWEEN @Start_Date AND @End_Date
)
SELECT h.Source,
       h.Date,
       h.Reading
FROM HOURLY_DATA h
WHERE h.SOURCE_HOUR_ROW = 1

But its still very slow to execute, sometimes taking 5 minutes or more to complete. I would like a faster way to get this data. I've looked at the Explain Plan, but cant see an obvious solution.

Thank you for looking.


Solution

  • The slowness is caused by the volume of data in the CTE. I found this solution which works faster How to sample records by time