Search code examples
sqlsql-servert-sqlsql-server-2017

Lookup previous values on condition in SQL Server 2017 using a table as a queue


When looking up a row on a table called reason:

machine_id  reason   start_time
001234      moving   10:00:00
001234      parked   10:10:00
001234      moving   10:15:00
001234      NULL     10:20:00
001234      NULL     10:25:00
001234      dumping  10:30:00

009876      parked   10:00:00
009876      NULL     10:10:00
009876      NULL     10:15:00
009876      moving   10:20:00
009876      dumping  10:25:00

For any reason's with a NULL value I need to get the most recent value that wasn't NULL, so 001234 NULL values would both become 'moving', and 009876 NULL values would both become 'parked'.

I would normally solve this with a cross apply like:

SELECT 
    r1.machine_id    
    ,ISNULL(r1.reason,r2.reason) AS reason
    ,r1.start_time
FROM #reason r1
CROSS APPLY(
    SELECT TOP 1
            r2.reason
    FROM #reason r2
    WHERE r2.machine_oid = r1.machine_oid 
    AND r2.start_time < r1.start_time
    AND r2.reason IS NOT NULL
    ORDER BY start_time DESC
) r2

But this table I'm querying is hundreds of thousands of rows (can't modify the source database) and the complexity of the query seems close to n^2.

In C++ I would solve by using a priority queue and discarding items from the list of objects to check that didn't meet the criteria so the complexity is closer to nlogn.

I tried understanding the post about using tables as queues here: http://rusanu.com/2010/03/26/using-tables-as-queues but it was beyond my skill level.

Because it's a very common requirement for my datasets, I was hoping there is an elegant solution that can be applied?


Solution

  • Something like this:

    DECLARE @DataSource TABLE
    (
        [machine_id] VARCHAR(6)
       ,[reason] VARCHAR(12)
       ,[start_time] TIME
    );
    
    INSERT INTO @DataSource([machine_id], [reason], [start_time])
    VALUES ('001234', 'moving', '10:00:00')
          ,('001234', 'parked', '10:10:00')
          ,('001234', 'moving', '10:15:00')
          ,('001234', NULL, '10:20:00')
          ,('001234', NULL, '10:25:00')
          ,('001234', 'dumping', '10:30:00')
          ,('009876', 'parked', '10:00:00')
          ,('009876', NULL, '10:10:00')
          ,('009876', NULL, '10:15:00')
          ,('009876', 'moving', '10:20:00')
          ,('009876', 'dumping',  '10:25:00');
    
    
    SELECT [machine_id]
          ,[reason] AS [reason_old]
          ,ISNULL([reason], MAX([Reason]) OVER (PARTITION BY [machine_id], [RowID])) AS [reason]
          ,[start_time]
    FROM 
    (
        SELECT *
              ,SUM(IIF([reason] IS NULL, 0, 1)) OVER (PARTITION BY [machine_id] ORDER BY [start_time] ASC) AS [RowID]
        FROM @DataSource 
    ) DS
    ORDER BY [machine_id]
            ,[start_time];
    

    enter image description here

    The idea is to use SUM to group the records with NULL value with the the first record with NOT NULL value.

    SELECT *
          ,SUM(IIF([reason] IS NULL, 0, 1)) OVER (PARTITION BY [machine_id] ORDER BY [start_time] ASC) AS [RowID]
    FROM @DataSource;
    

    enter image description here

    Then, we can simply get the MAX/MIN value for such group as these aggregates ignore NULLs and will return the NOT NULL value.