Search code examples
sql-servert-sqlsql-server-2014

Join most recent event into daily splits


I currently have a table of events with a time stamp and a "state" description of that event. The distribution of these events are rather random--I may have a hundred events in a day, followed by multiple days of inactivity. These events are thrown by individual machines, which can be identified by a machine ID.

The view exposing these events is defined thus:

create view V_Machine_State_Window as

    select LOCAL_DATETIME,
        MACHINE,
        STATE,
        datediff(second, LOCAL_DATETIME, lead(LOCAL_DATETIME,1) over(order by MACHINE, LOCAL_DATETIME)) as DURATION,
        isnull(left(STATE_TEXT, 60), '') as STATE_TEXT,
        cast(LOCAL_DATETIME as date) as STATE_DATE,
        cast(LOCAL_DATETIME as time) as STATE_TIME,
        lag(STATE,1) over(order by MACHINE, LOCAL_DATETIME) as PREV_STATE,
        isnull(left(lag(STATE_TEXT,1) over(order by MACHINE, LOCAL_DATETIME), 60), '') as PREV_STATE_TEXT,
        lag(LOCAL_DATETIME,1) over(order by MACHINE, LOCAL_DATETIME) as PREV_STATE_DATESTAMP,
        lead(STATE,1) over(order by MACHINE, LOCAL_DATETIME) as NEXT_STATE,
        isnull(left(lead(STATE_TEXT,1) over(order by MACHINE, LOCAL_DATETIME), 60), '') as NEXT_STATE_TEXT,
        lead(LOCAL_DATETIME,1) over(order by MACHINE, LOCAL_DATETIME) as NEXT_STATE_DATESTAMP
    from MACHINE_STATE_TABLE
    where LOCAL_DATETIME > DATEADD(day, -400, getdate())

For purposes of reporting, it's important that we have the duration of the state, the previous state it was in if there was one, and the next state it changed to if there is one.

As a requirement for analysis of machine behavior, I need to segregate this data by individual days, including adding a record for the beginning each day with the "state" of the most recent event thrown by that machine. With the table Day being just a list of days so I can cross apply, I'm using the following to add a daily record to every machine:

with m_in_scope
as (
    select distinct MACHINE
    from V_Machine_State_Window)

select d.Day as LOCAL_DATETIME,
    m_in_scope.MACHINE as MACHINE,
    NULL as STATE,
    NULL as DURATION,
    NULL as STATE_TEXT,
    d.Day as STATE_DATE,
    cast('00:00' as time) as STATE_TIME,
    NULL as PREV_STATE,
    NULL as PREV_STATE_TEXT,
    NULL as PREV_STATE_DATESTAMP,
    NULL as NEXT_STATE,
    NULL as NEXT_STATE_TEXT,
    NULL as NEXT_STATE_DATESTAMP
from Day d
cross join m_in_scope
where d.Day > DATEADD(day, -400, getdate())
    and d.Day < getdate()

The thing I'm struggling to do is figure out how to fill in those nulls. The state needs to be the most recent state before the start of the day, and the duration needs to be the number of seconds between the start of the day (00:00) and the datestamp of the next event or the end of the day, whichever is sooner. I'm comfortable doing all the maths here as long as I can get the previous and next records in my statement.

I've tried this as a CTE, but I can't find a rational way to partition.

I could just insert the nulls and run an update using a window expression, but as I specified I might have many of these daily records in a row, so the lag and lead of a window expression would only correctly update the next record after each "real" event, and not the ones after it. Doing this in a cursor to loop over it until there are no nulls seems like a really bad idea for processing time, as I'm looking at a table of approximately 50 million records split over a few thousand machines.

Is there a rational way to craft either a CTE or window expression (or some other alternative I haven't yet thought of) that will give me the previous and next record with a populated state from what I've described above?

NOTE: The server doing this processing is on SQL 2014. Please let me know if the right solution for this problem is version-limited.


Solution

  • Taking what you put in the fiddle, and turning it into DDL/DML:

    DECLARE @V_Machine_State_Window TABLE (Local_Datetime DATETIME, Machine NVARCHAR(128), State BIT, Duration INT, State_Text NVARCHAR(20), State_Date DATE, State_time TIME, Prev_State BIT, Prev_State_Text NVARCHAR(20),
                                           Prev_State_Datestamp DATETIME, Next_State BIT, Next_State_Text NVARCHAR(20), Next_State_Datestamp DATETIME);
    
    INSERT INTO @V_Machine_State_Window (Local_Datetime, Machine, State, Duration, State_Text, State_Date, State_Time, Prev_State, Prev_State_Text, Prev_State_Datestamp, Next_State, Next_State_Text, Next_State_Datestamp) VALUES
    ('2023-06-01 00:00:00.000', 'PC1', 0, 300   , 'Idle', '2023-06-01', '00:00:00.0000000', null, null, null, 1, 'Active', '2023-06-01 00:05:00.000'),
    ('2023-06-01 00:05:00.000', 'PC1', 1, 600   , 'Active', '2023-06-01', '00:05:00.0000000',   0, 'Idle', '2023-06-01 00:00:00.000', 0, 'Idle', '2023-06-01 00:15:00.000'),
    ('2023-06-01 00:15:00.000', 'PC1', 0, 373500, 'Idle', '2023-06-01', '00:15:00.0000000', 1, 'Active', '2023-06-01 00:05:00.000', 1, 'Active', '2023-06-05 08:00:00.000'),
    ('2023-06-05 08:00:00.000', 'PC1', 1, 600   , 'Active', '2023-06-05', '08:00:00.0000000', 0, 'Idle', '2023-06-01 00:15:00.000', 0, 'Idle', '2023-06-05 08:10:00.000'),
    ('2023-06-01 21:32:00.000', 'PC2', 1, 600, 'Spin', '2023-06-04', '21:32:00.00000', NULL, NULL, NULL, NULL, NULL, NULL),
    ('2023-06-04 21:32:00.000', 'PC2', 1, 600, 'Spin', '2023-06-04', '21:32:00.00000', NULL, NULL, NULL, NULL, NULL, NULL);
    

    (I added a couple of rows, for a second machine name). The Prev/Next states are superfluous, as we'll need to calculate them again after the addition of your daily start rows, as we will with the duration column, which seems to be based on the difference in seconds between this rows Local_Datetime and the next one.

    You mention you have a Days table, This example generates that data using a rCTE to fill in the series between the min and max dates in the table. This is then CROSS APPLY'd to the unique machine names, so we end up with a day, per machine.

    Finally, a result set based on the MachineDays is created by using an OUTER APPLY on the table itself to find the most recent actual row to pair with midnight on each day. This is UNION'd with the generated rows, and we can then perform the LAG and LEAD functions on it to fill in the previous, next and duration values.

    ;WITH Days AS (
    SELECT MIN(State_Date) AS Datetime, MAX(State_Date) AS mxDatetime
      FROM @V_Machine_State_Window
    UNION ALL
    SELECT DATEADD(DAY,1,Datetime), mxDatetime
      FROM Days
     WHERE Datetime < mxDatetime
    ), Machines AS (
    SELECT Machine
      FROM @V_Machine_State_Window
     GROUP BY Machine
    ), MachineDays AS (
    SELECT CAST(Datetime AS DATETIME) AS Datetime, Machine
      FROM Days  
        CROSS APPLY Machines
    )
    
    SELECT Local_Datetime, Machine, State, 
           DATEDIFF(SECOND,Local_Datetime,LEAD(Local_DateTime, 1) OVER (PARTITION BY Machine ORDER BY Local_Datetime)) AS Duration, 
           State_Text, State_Date, State_Time, 
           LAG(State,1) OVER (PARTITION BY Machine ORDER BY Local_Datetime) AS Prev_State, LAG(State_Text,1) OVER (PARTITION BY Machine ORDER BY Local_Datetime) AS Prev_State_Text, 
           LAG(Local_Datetime,1) OVER (PARTITION BY Machine ORDER BY Local_Datetime) AS Prev_State_Datestamp,
           LEAD(State,1) OVER (PARTITION BY Machine ORDER BY Local_Datetime) AS Next_State, LEAD(State_Text,1) OVER (PARTITION BY Machine ORDER BY Local_Datetime) AS Next_State_Text, 
           LEAD(Local_Datetime,1) OVER (PARTITION BY Machine ORDER BY Local_Datetime) AS Next_State_Datestamp
      FROM (
            SELECT md.Datetime AS Local_Datetime, a.Machine, a.State, a.Duration, a.State_Text, CAST(md.Datetime AS DATE) AS State_Date, '00:00:00' AS State_Time
              FROM MachineDays md
                OUTER APPLY (SELECT TOP 1 * FROM @V_Machine_State_Window WHERE Machine = md.Machine AND Local_Datetime < md.Datetime ORDER BY Local_Datetime DESC) a
             WHERE a.Machine IS NOT NULL
            UNION
            SELECT a.Local_Datetime, a.Machine, a.State, a.Duration, a.State_Text, a.State_Date, a.State_Time
              FROM @V_Machine_State_Window a
           ) a
    
    Local_Datetime Machine State Duration State_Text State_Date State_Time Prev_State Prev_State_Text Prev_State_Datestamp Next_State Next_State_Text Next_State_Datestamp
    2023-06-01 00:00:00.000 PC1 0 300 Idle 2023-06-01 00:00:00.0000000 1 Active 2023-06-01 00:05:00.000
    2023-06-01 00:05:00.000 PC1 1 600 Active 2023-06-01 00:05:00.0000000 0 Idle 2023-06-01 00:00:00.000 0 Idle 2023-06-01 00:15:00.000
    2023-06-01 00:15:00.000 PC1 0 85500 Idle 2023-06-01 00:15:00.0000000 1 Active 2023-06-01 00:05:00.000 0 Idle 2023-06-02 00:00:00.000
    2023-06-02 00:00:00.000 PC1 0 86400 Idle 2023-06-02 00:00:00.0000000 0 Idle 2023-06-01 00:15:00.000 0 Idle 2023-06-03 00:00:00.000
    2023-06-03 00:00:00.000 PC1 0 86400 Idle 2023-06-03 00:00:00.0000000 0 Idle 2023-06-02 00:00:00.000 0 Idle 2023-06-04 00:00:00.000
    2023-06-04 00:00:00.000 PC1 0 86400 Idle 2023-06-04 00:00:00.0000000 0 Idle 2023-06-03 00:00:00.000 0 Idle 2023-06-05 00:00:00.000
    2023-06-05 00:00:00.000 PC1 0 28800 Idle 2023-06-05 00:00:00.0000000 0 Idle 2023-06-04 00:00:00.000 1 Active 2023-06-05 08:00:00.000
    2023-06-05 08:00:00.000 PC1 1 Active 2023-06-05 08:00:00.0000000 0 Idle 2023-06-05 00:00:00.000
    2023-06-01 21:32:00.000 PC2 1 8880 Spin 2023-06-04 21:32:00.0000000 1 Spin 2023-06-02 00:00:00.000
    2023-06-02 00:00:00.000 PC2 1 86400 Spin 2023-06-02 00:00:00.0000000 1 Spin 2023-06-01 21:32:00.000 1 Spin 2023-06-03 00:00:00.000
    2023-06-03 00:00:00.000 PC2 1 86400 Spin 2023-06-03 00:00:00.0000000 1 Spin 2023-06-02 00:00:00.000 1 Spin 2023-06-04 00:00:00.000
    2023-06-04 00:00:00.000 PC2 1 77520 Spin 2023-06-04 00:00:00.0000000 1 Spin 2023-06-03 00:00:00.000 1 Spin 2023-06-04 21:32:00.000
    2023-06-04 21:32:00.000 PC2 1 8880 Spin 2023-06-04 21:32:00.0000000 1 Spin 2023-06-04 00:00:00.000 1 Spin 2023-06-05 00:00:00.000
    2023-06-05 00:00:00.000 PC2 1 Spin 2023-06-05 00:00:00.0000000 1 Spin 2023-06-04 21:32:00.000