When looking up a row on a table called reason:
machine_id reason start_time
001234 moving 10:00:00
001234 parked 10:10:00
001234 moving 10:15:00
001234 NULL 10:20:00
001234 NULL 10:25:00
001234 dumping 10:30:00
009876 parked 10:00:00
009876 NULL 10:10:00
009876 NULL 10:15:00
009876 moving 10:20:00
009876 dumping 10:25:00
For any reason's with a NULL value I need to get the most recent value that wasn't NULL, so 001234 NULL values would both become 'moving', and 009876 NULL values would both become 'parked'.
I would normally solve this with a cross apply like:
SELECT
r1.machine_id
,ISNULL(r1.reason,r2.reason) AS reason
,r1.start_time
FROM #reason r1
CROSS APPLY(
SELECT TOP 1
r2.reason
FROM #reason r2
WHERE r2.machine_oid = r1.machine_oid
AND r2.start_time < r1.start_time
AND r2.reason IS NOT NULL
ORDER BY start_time DESC
) r2
But this table I'm querying is hundreds of thousands of rows (can't modify the source database) and the complexity of the query seems close to n^2.
In C++ I would solve by using a priority queue and discarding items from the list of objects to check that didn't meet the criteria so the complexity is closer to nlogn.
I tried understanding the post about using tables as queues here: http://rusanu.com/2010/03/26/using-tables-as-queues but it was beyond my skill level.
Because it's a very common requirement for my datasets, I was hoping there is an elegant solution that can be applied?
Something like this:
DECLARE @DataSource TABLE
(
[machine_id] VARCHAR(6)
,[reason] VARCHAR(12)
,[start_time] TIME
);
INSERT INTO @DataSource([machine_id], [reason], [start_time])
VALUES ('001234', 'moving', '10:00:00')
,('001234', 'parked', '10:10:00')
,('001234', 'moving', '10:15:00')
,('001234', NULL, '10:20:00')
,('001234', NULL, '10:25:00')
,('001234', 'dumping', '10:30:00')
,('009876', 'parked', '10:00:00')
,('009876', NULL, '10:10:00')
,('009876', NULL, '10:15:00')
,('009876', 'moving', '10:20:00')
,('009876', 'dumping', '10:25:00');
SELECT [machine_id]
,[reason] AS [reason_old]
,ISNULL([reason], MAX([Reason]) OVER (PARTITION BY [machine_id], [RowID])) AS [reason]
,[start_time]
FROM
(
SELECT *
,SUM(IIF([reason] IS NULL, 0, 1)) OVER (PARTITION BY [machine_id] ORDER BY [start_time] ASC) AS [RowID]
FROM @DataSource
) DS
ORDER BY [machine_id]
,[start_time];
The idea is to use SUM
to group the records with NULL
value with the the first record with NOT NULL
value.
SELECT *
,SUM(IIF([reason] IS NULL, 0, 1)) OVER (PARTITION BY [machine_id] ORDER BY [start_time] ASC) AS [RowID]
FROM @DataSource;
Then, we can simply get the MAX/MIN
value for such group as these aggregates ignore NULL
s and will return the NOT NULL
value.