Search code examples
mysqlsql-serverdata-warehouse

Efficient query to get step duration from an event logs table in sql into an accumulating snapshot fact


This example was built in SQL Server 2016 but it should also apply to MySQL 8.X.

I have event logs data stored in a table fact_user_event_activity with the following sample data:

event_date_key  user_key    step_key    session_id  event_timestamp
20140411        123         1           1000        2014-04-11 08:00:00.000
20140411        123         2           1000        2014-04-11 08:10:00.000
20140411        123         3           1000        2014-04-11 08:20:00.000
20140411        123         4           1000        2014-04-11 08:30:00.000
20140411        125         1           1001        2014-04-11 09:10:00.000
20140411        123         5           1000        2014-04-11 08:31:00.000
20140411        125         2           1001        2014-04-11 09:30:00.000
20140411        125         3           1001        2014-04-11 09:50:00.000  <-- 
20140411        125         3           1001        2014-04-11 09:51:00.000  <--
20140411        125         4           1001        2014-04-11 09:52:00.000

Assumptions:

  • All incoming records by user_key are ordered by date. However records are not ordered by user_key. For example, take a look of user_key 125 on 2014-04-11 09:10:00.000.
  • Steps are predictable. This process will always contain 5 steps where the last step means EXIT
  • Steps on the same session can be logged multiple times at different dates

Expected

What would be the most efficient way to query the following?

user_key     session_id    step_1_duration_mins    step_2_duration_mins     step_3_duration_mins    step_4_duration_mins
123             1000           10                         10                       10                    1
125             1001           20                         20                        2                 NULL

This will be used as an ETL query for an accumulating snapshot

Setup

DROP TABLE IF EXISTS  [fact_user_event_activity]
;
CREATE TABLE [fact_user_event_activity] (
  [event_date_key] INT DEFAULT NULL,
  [user_key] BIGINT NOT NULL,
  [step_key] BIGINT NOT NULL,
  [session_id] BIGINT NOT NULL,
  [event_timestamp] datetime NOT NULL
)
;
INSERT INTO [fact_user_event_activity]
VALUES (20140411, 123, 1, 1000, N'2014-04-11 08:00:00'),
(20140411, 123, 2, 1000, N'2014-04-11 08:10:00'),
(20140411, 123, 3, 1000, N'2014-04-11 08:20:00'),
(20140411, 123, 4, 1000, N'2014-04-11 08:30:00'),
(20140411, 125, 1, 1001, N'2014-04-11 09:10:00'),
(20140411, 123, 5, 1000, N'2014-04-11 08:31:00'),
(20140411, 125, 2, 1001, N'2014-04-11 09:30:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:50:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:51:00'),
(20140411, 125, 4, 1001, N'2014-04-11 09:52:00'),
(20140411, 129, 1, 1005, N'2014-04-11 09:08:00'),
(20140411, 129, 2, 1005, N'2014-04-11 09:10:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:12:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:13:00'),
(20140411, 129, 4, 1005, N'2014-04-11 09:14:00'),
(20140411, 129, 5, 1005, N'2014-04-11 09:18:00')
;

My attempt

To easily understand the code I approached this in two steps:

  1. Get every step's duration from the start (start of session)
  2. Calculate the difference between every step's duration_from_start

This returns what I'm expecting but I'm sure that I might be over-complicating things and this will run against ~ 500 M records, so I was wondering if there is a better approach or if I'm missing something.

-- Step 1
-- to improve performance, use temp table instead of CTE
-- Use TIMESTAMPDIFF in MySQL instead of DATEDIFF
WITH durations_from_start_tmp AS
    (
    SELECT session_id, user_key, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp) first_login,
    DENSE_RANK() OVER(PARTITION BY user_key, step_key, fuea.session_id ORDER BY fuea.event_timestamp) AS rnk,
    CASE WHEN step_key = 2 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_1_duration_from_start,
    CASE WHEN step_key = 3 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_2_duration_from_start,
    CASE WHEN step_key = 4 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_3_duration_from_start,
    CASE WHEN step_key = 5 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_4_duration_from_start
    FROM [fact_user_event_activity] fuea
    --WHERE event_timestamp > watermark --for incremental load
    )

-- Step 2
SELECT user_key, session_id, SUM(step_1_duration_from_start) AS step_1_duration_mins,
 SUM(step_2_duration_from_start) - SUM(step_1_duration_from_start) AS step_2_duration_mins ,
 SUM(step_3_duration_from_start) - SUM(step_2_duration_from_start) AS step_3_duration_mins ,
 SUM(step_4_duration_from_start) - SUM(step_3_duration_from_start) AS step_4_duration_mins
 FROM durations_from_start_tmp
 -- deals with repeated steps
 WHERE rnk = 1
 GROUP BY  user_key, session_id

References

This might not be relevant to get the answer but just in case you're not familiar with Data Modeling concepts

Accumulating Snapshots Definition


Solution

  • So one approach you might take is adding an index (assuming you can add one) such as:

    CREATE INDEX [SomeIndexName] ON [fact_user_event_activity] (user_key, session_id, step_key, event_timestamp);

    (Or you could just do an include on step_key, event_timestamp if you're concerned about the size of the index on 500m rows.)

    Then skip using the window functions with a query such as the following:

    SELECT user_key,
           session_id,
           step_1_duration = DATEDIFF(MINUTE, step_1_timestamp, step_2_timestamp),
           step_2_duration = DATEDIFF(MINUTE, step_2_timestamp, step_3_timestamp),
           step_3_duration = DATEDIFF(MINUTE, step_3_timestamp, step_4_timestamp),
           step_4_duration = DATEDIFF(MINUTE, step_4_timestamp, step_5_timestamp)
    FROM 
    (
        SELECT user_key, session_id,
               step_1_timestamp = MIN(CASE WHEN step_key = 1 THEN event_timestamp END),
               step_2_timestamp = MIN(CASE WHEN step_key = 2 THEN event_timestamp END),
               step_3_timestamp = MIN(CASE WHEN step_key = 3 THEN event_timestamp END),
               step_4_timestamp = MIN(CASE WHEN step_key = 4 THEN event_timestamp END),
               step_5_timestamp = MIN(CASE WHEN step_key = 5 THEN event_timestamp END)
        FROM fact_user_event_activity
        GROUP BY user_key, session_id
    ) AS T;
    

    (Which would theoretically do just an index scan without requiring any sorts.)