Assume I have data structured as so in BigQuery:
WITH session_log AS (
SELECT 'ABC' as site_id, 1234 user_id, 12 session_id, '2020-02-10 00:29:59.376000 UTC' start_time, '2020-02-10 01:13:02.817000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 13 session_id, '2020-02-10 02:41:56.330000 UTC' start_time, '2020-02-10 02:41:56.389999 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 14 session_id, '2020-02-10 04:24:46.649999 UTC' start_time, '2020-02-10 05:14:08.243000 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 1234 user_id, 15 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
SELECT 'ABC' as site_id, 6789 user_id, 25 session_id, '2020-02-10 02:15:38.560000 UTC' start_time, '2020-02-10 02:56:38.784500 UTC' end_time UNION ALL
SELECT 'ABC' as site_id, 6789 user_id, 26 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
)
SELECT site_id, user_id, session_id, start_time, end_time FROM session_log
I would like to query the session_log
so that a session is flagged as "concurrent" if there is another record with the same site_id
and user_id
whose timeframe overlaps another session. (The meaning here is that a user is visiting the same site on more than one device simultaneously.)
Ideally, I need a query that would produce the following, because at least a portion of sessions 14 and 15 overlap for that specific user. Session 26 overlaps sessions 14 & 15 but is not concurrent because it is a different user_id
.
site_id | user_id | session_id | start_time | end_time | concurrent_session |
---|---|---|---|---|---|
ABC | 1234 | 12 | 2020-02-10 00:29:59.376000 UTC | 2020-02-10 01:13:02.817000 UTC | false |
ABC | 1234 | 13 | 2020-02-10 02:41:56.330000 UTC | 2020-02-10 02:41:56.389999 UTC | false |
ABC | 1234 | 14 | 2020-02-10 04:24:46.649999 UTC | 2020-02-10 05:14:08.243000 UTC | true |
ABC | 1234 | 15 | 2020-02-10 04:59:21.356999 UTC | 2020-02-10 05:57:11.501000 UTC | true |
ABC | 6789 | 25 | 2020-02-10 02:15:38.560000 UTC | 2020-02-10 02:56:38.784500 UTC | false |
ABC | 6789 | 26 | 2020-02-10 04:44:21.356999 UTC | 2020-02-10 06:57:11.501000 UTC | false |
I tried to create a user defined function that would search the table for sessions with that same site_id
and user_id
but not the session_id
with overlapping times but it's failing miserably. I'm almost embarrassed to put this here, but...what the heck.
CREATE TEMPORARY FUNCTION getConcurrentSessions(_site_id STRING, _user_id INT64, _session_id INT64, _start_time TIMESTAMP, _end_time TIMESTAMP)
AS
(
(
SELECT count(session_id)
FROM `session_log`
WHERE site_id = _site_id
AND user_id = _user_id
AND session_id != _session_id
AND (
(_start_time BETWEEN start_time AND end_time)
OR
(_end_time BETWEEN start_time AND end_time)
)
)
);
SELECT site_id, user_id, session_id, start_time, end_time,
IF (
getConcurrentSessions(site_id, user_id, session_id, start_time, end_time) > 0,
TRUE,
FALSE
) AS concurrent_sessions
FROM session_log
Any and all advice appreciated. Thank you.
Consider below approach
select *,
ifnull(start_time <= lag(end_time) over win or
end_time >= lead(start_time) over win, false) as concurrent_session
from your_table
window win as (partition by site_id, user_id order by start_time)
if applied to sample data in y our question - output is