This example was built in SQL Server 2016 but it should also apply to MySQL 8.X.
I have event logs data stored in a table fact_user_event_activity
with the following sample data:
event_date_key user_key step_key session_id event_timestamp
20140411 123 1 1000 2014-04-11 08:00:00.000
20140411 123 2 1000 2014-04-11 08:10:00.000
20140411 123 3 1000 2014-04-11 08:20:00.000
20140411 123 4 1000 2014-04-11 08:30:00.000
20140411 125 1 1001 2014-04-11 09:10:00.000
20140411 123 5 1000 2014-04-11 08:31:00.000
20140411 125 2 1001 2014-04-11 09:30:00.000
20140411 125 3 1001 2014-04-11 09:50:00.000 <--
20140411 125 3 1001 2014-04-11 09:51:00.000 <--
20140411 125 4 1001 2014-04-11 09:52:00.000
Assumptions:
125
on 2014-04-11 09:10:00.000
.Expected
What would be the most efficient way to query the following?
user_key session_id step_1_duration_mins step_2_duration_mins step_3_duration_mins step_4_duration_mins
123 1000 10 10 10 1
125 1001 20 20 2 NULL
This will be used as an ETL query for an accumulating snapshot
Setup
DROP TABLE IF EXISTS [fact_user_event_activity]
;
CREATE TABLE [fact_user_event_activity] (
[event_date_key] INT DEFAULT NULL,
[user_key] BIGINT NOT NULL,
[step_key] BIGINT NOT NULL,
[session_id] BIGINT NOT NULL,
[event_timestamp] datetime NOT NULL
)
;
INSERT INTO [fact_user_event_activity]
VALUES (20140411, 123, 1, 1000, N'2014-04-11 08:00:00'),
(20140411, 123, 2, 1000, N'2014-04-11 08:10:00'),
(20140411, 123, 3, 1000, N'2014-04-11 08:20:00'),
(20140411, 123, 4, 1000, N'2014-04-11 08:30:00'),
(20140411, 125, 1, 1001, N'2014-04-11 09:10:00'),
(20140411, 123, 5, 1000, N'2014-04-11 08:31:00'),
(20140411, 125, 2, 1001, N'2014-04-11 09:30:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:50:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:51:00'),
(20140411, 125, 4, 1001, N'2014-04-11 09:52:00'),
(20140411, 129, 1, 1005, N'2014-04-11 09:08:00'),
(20140411, 129, 2, 1005, N'2014-04-11 09:10:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:12:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:13:00'),
(20140411, 129, 4, 1005, N'2014-04-11 09:14:00'),
(20140411, 129, 5, 1005, N'2014-04-11 09:18:00')
;
My attempt
To easily understand the code I approached this in two steps:
This returns what I'm expecting but I'm sure that I might be over-complicating things and this will run against ~ 500 M records, so I was wondering if there is a better approach or if I'm missing something.
-- Step 1
-- to improve performance, use temp table instead of CTE
-- Use TIMESTAMPDIFF in MySQL instead of DATEDIFF
WITH durations_from_start_tmp AS
(
SELECT session_id, user_key, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp) first_login,
DENSE_RANK() OVER(PARTITION BY user_key, step_key, fuea.session_id ORDER BY fuea.event_timestamp) AS rnk,
CASE WHEN step_key = 2 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_1_duration_from_start,
CASE WHEN step_key = 3 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_2_duration_from_start,
CASE WHEN step_key = 4 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_3_duration_from_start,
CASE WHEN step_key = 5 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_4_duration_from_start
FROM [fact_user_event_activity] fuea
--WHERE event_timestamp > watermark --for incremental load
)
-- Step 2
SELECT user_key, session_id, SUM(step_1_duration_from_start) AS step_1_duration_mins,
SUM(step_2_duration_from_start) - SUM(step_1_duration_from_start) AS step_2_duration_mins ,
SUM(step_3_duration_from_start) - SUM(step_2_duration_from_start) AS step_3_duration_mins ,
SUM(step_4_duration_from_start) - SUM(step_3_duration_from_start) AS step_4_duration_mins
FROM durations_from_start_tmp
-- deals with repeated steps
WHERE rnk = 1
GROUP BY user_key, session_id
References
This might not be relevant to get the answer but just in case you're not familiar with Data Modeling concepts
So one approach you might take is adding an index (assuming you can add one) such as:
CREATE INDEX [SomeIndexName] ON [fact_user_event_activity] (user_key, session_id, step_key, event_timestamp);
(Or you could just do an include on step_key, event_timestamp if you're concerned about the size of the index on 500m rows.)
Then skip using the window functions with a query such as the following:
SELECT user_key,
session_id,
step_1_duration = DATEDIFF(MINUTE, step_1_timestamp, step_2_timestamp),
step_2_duration = DATEDIFF(MINUTE, step_2_timestamp, step_3_timestamp),
step_3_duration = DATEDIFF(MINUTE, step_3_timestamp, step_4_timestamp),
step_4_duration = DATEDIFF(MINUTE, step_4_timestamp, step_5_timestamp)
FROM
(
SELECT user_key, session_id,
step_1_timestamp = MIN(CASE WHEN step_key = 1 THEN event_timestamp END),
step_2_timestamp = MIN(CASE WHEN step_key = 2 THEN event_timestamp END),
step_3_timestamp = MIN(CASE WHEN step_key = 3 THEN event_timestamp END),
step_4_timestamp = MIN(CASE WHEN step_key = 4 THEN event_timestamp END),
step_5_timestamp = MIN(CASE WHEN step_key = 5 THEN event_timestamp END)
FROM fact_user_event_activity
GROUP BY user_key, session_id
) AS T;
(Which would theoretically do just an index scan without requiring any sorts.)