Search code examples
sqlgoogle-bigquerybigquery-udf

Find overlapping time periods in BigQuery


Assume I have data structured as so in BigQuery:

WITH session_log AS (
  SELECT 'ABC' as site_id, 1234 user_id, 12 session_id, '2020-02-10 00:29:59.376000 UTC' start_time, '2020-02-10 01:13:02.817000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 13 session_id, '2020-02-10 02:41:56.330000 UTC' start_time, '2020-02-10 02:41:56.389999 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 14 session_id, '2020-02-10 04:24:46.649999 UTC' start_time, '2020-02-10 05:14:08.243000 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 1234 user_id, 15 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time  
  SELECT 'ABC' as site_id, 6789 user_id, 25 session_id, '2020-02-10 02:15:38.560000 UTC' start_time, '2020-02-10 02:56:38.784500 UTC' end_time UNION ALL
  SELECT 'ABC' as site_id, 6789 user_id, 26 session_id, '2020-02-10 04:59:21.356999 UTC' start_time, '2020-02-10 15:57:11.501000 UTC' end_time
)
SELECT site_id, user_id, session_id, start_time, end_time FROM session_log

I would like to query the session_log so that a session is flagged as "concurrent" if there is another record with the same site_id and user_id whose timeframe overlaps another session. (The meaning here is that a user is visiting the same site on more than one device simultaneously.)

Ideally, I need a query that would produce the following, because at least a portion of sessions 14 and 15 overlap for that specific user. Session 26 overlaps sessions 14 & 15 but is not concurrent because it is a different user_id.

site_id user_id session_id start_time end_time concurrent_session
ABC 1234 12 2020-02-10 00:29:59.376000 UTC 2020-02-10 01:13:02.817000 UTC false
ABC 1234 13 2020-02-10 02:41:56.330000 UTC 2020-02-10 02:41:56.389999 UTC false
ABC 1234 14 2020-02-10 04:24:46.649999 UTC 2020-02-10 05:14:08.243000 UTC true
ABC 1234 15 2020-02-10 04:59:21.356999 UTC 2020-02-10 05:57:11.501000 UTC true
ABC 6789 25 2020-02-10 02:15:38.560000 UTC 2020-02-10 02:56:38.784500 UTC false
ABC 6789 26 2020-02-10 04:44:21.356999 UTC 2020-02-10 06:57:11.501000 UTC false

I tried to create a user defined function that would search the table for sessions with that same site_id and user_id but not the session_id with overlapping times but it's failing miserably. I'm almost embarrassed to put this here, but...what the heck.

CREATE TEMPORARY FUNCTION getConcurrentSessions(_site_id STRING, _user_id INT64, _session_id INT64, _start_time TIMESTAMP, _end_time TIMESTAMP)
AS 
(
    (
        SELECT count(session_id)
        FROM `session_log`
        WHERE site_id = _site_id
        AND user_id = _user_id
        AND session_id != _session_id
        AND (
            (_start_time BETWEEN start_time AND end_time)
            OR
            (_end_time BETWEEN start_time AND end_time)
        )
    )
);
SELECT site_id, user_id, session_id, start_time, end_time,
IF (
  getConcurrentSessions(site_id, user_id, session_id, start_time, end_time) > 0,
  TRUE,
  FALSE
) AS concurrent_sessions 
FROM session_log

Any and all advice appreciated. Thank you.


Solution

  • Consider below approach

    select *, 
      ifnull(start_time <= lag(end_time) over win or 
      end_time >= lead(start_time) over win, false) as concurrent_session 
    from your_table
    window win as (partition by site_id, user_id order by start_time)         
    

    if applied to sample data in y our question - output is

    enter image description here