Search code examples
sqlpostgresqlrankgaps-and-islands

how to get 1st column value with 30 minutes rule and sessionize the activity data


I have an activity table called myTable, the data looks like this:

CREATE TABLE myTable
(
    userid      text,
    webid         text,
    "ts" timestamp
);

INSERT INTO myTable
    ("userid", "webid", "ts")
VALUES ('A', '34', '2023-01-31 16:34:49.000'),
       ('A', '97', '2023-01-31 16:34:58.000'),
       ('A', '17', '2023-01-31 17:35:02.000'),
       ('A', '17', '2023-01-31 17:35:07.000'),
       ('A', '17', '2023-01-31 17:35:18.000'),
       ('A', '1', '2023-01-31 17:35:37.000'),
       ('A', '1', '2023-01-31 17:35:38.000'),
       ('A', '77', '2023-01-31 17:35:41.000'),
       ('A', '77', '2023-01-31 17:35:42.000'),
       ('A', '1', '2023-01-31 17:37:10.000'),
       ('A', '1', '2023-01-31 17:37:12.000'),
       ('A', '77', '2023-01-31 17:37:14.000'),
       ('A', '77', '2023-01-31 17:52:14.000'),
       ('A', '77', '2023-01-31 18:12:14.000'),
       ('A', '77', '2023-01-31 18:45:14.000'),
       ('A', '77', '2023-01-31 18:55:15.000'),
       ('B', '33', '2023-01-31 06:37:15.000'),
       ('B', '56', '2023-01-31 06:40:15.000')
  ;

userid webid ts
A 34 2023-01-31 16:34:49
A 97 2023-01-31 17:34:58
A 17 2023-01-31 17:35:02
A 17 2023-01-31 17:35:07
A 17 2023-01-31 17:35:18
A 1 2023-01-31 17:35:37
A 1 2023-01-31 17:35:38
A 77 2023-01-31 17:35:41
A 77 2023-01-31 17:35:42
A 1 2023-01-31 17:37:10
A 1 2023-01-31 17:37:12
A 77 2023-01-31 17:37:14
A 77 2023-01-31 17:52:14
A 77 2023-01-31 18:12:14
A 77 2023-01-31 18:45:14
A 77 2023-01-31 18:55:15
B 33 2023-01-31 06:37:15
B 56 2023-01-31 06:40:15

The output I want to return as following:

userid webid ts first_ts session_id
A 34 2023-01-31 16:34:49 2023-01-31 16:34:49 1
A 97 2023-01-31 17:34:58 2023-01-31 17:34:58 2
A 17 2023-01-31 17:35:02 2023-01-31 17:35:02 3
A 17 2023-01-31 17:35:07 2023-01-31 17:35:02 3
A 17 2023-01-31 17:35:18 2023-01-31 17:35:02 3
A 1 2023-01-31 17:35:37 2023-01-31 17:35:37 4
A 1 2023-01-31 17:35:38 2023-01-31 17:35:37 4
A 77 2023-01-31 17:35:41 2023-01-31 17:35:41 5
A 77 2023-01-31 17:35:42 2023-01-31 17:35:41 5
A 1 2023-01-31 17:37:10 2023-01-31 17:37:10 6
A 1 2023-01-31 17:37:12 2023-01-31 17:37:10 6
A 77 2023-01-31 17:37:14 2023-01-31 17:37:14 7
A 77 2023-01-31 17:40:14 2023-01-31 17:52:14 7
A 77 2023-01-31 18:15:14 2023-01-31 18:12:14 7
A 77 2023-01-31 18:37:15 2023-01-31 18:45:14 8
A 77 2023-01-31 18:37:15 2023-01-31 18:55:15 8
B 33 2023-01-31 06:37:15 2023-01-31 06:37:15 1
B 56 2023-01-31 06:40:15 2023-01-31 06:40:15 2

first_ts means first timestamp. The logic for first_ts is If a userid visits same webid at consecutive timestamp and each timestamp's interval falls within 30 minutes, these will be recorded as a session. For example, row 3, 4, and 5. Each time interval are less than 30 minutes, their first_ts is the timestamp of the first event, which is 2023-01-31 17:35:02.

If a userid visits a webid and jump to another webid and return back to the first webid, the first_ts will be refreshed to current timestamp. For example, row 6, 7, 8, 9, 10, 11. userid A visits webid=1 at first, but jump to webid=77, after return back to webid=1, then row 10 and row 11 should be recorded as a new session with new first_ts.

If a userid visits a webid at consecutive timestamps, and one timestamp interval is greater than 30 minutes, it will also break the session and the first_ts will be refreshed as well. For example, row 12, 13, 14, 15, 16. Between row 14 and row 15, their time interval are 33 minutes (>30), so for row 15 and row 16, their first_ts refreshed to current timestamp and they are to be a new session.

I share my script here, the script haven't included the session_id because I found out it doesn't return correct first_ts for me to do the dense_rank on session_id. extract(epoch()) function is same as datediff() function in sql. Since I am testing the data using Postgresql, it doesn't have datediff() function. so I use exract(epoch()) to calculate time difference.

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY ts) rn1,
              ROW_NUMBER() OVER (PARTITION BY userid, webid ORDER BY ts) rn2
    FROM myTable
)

SELECT userid, webid, ts,
       lag(ts,1) over(partition by userid order by ts) as previous_ts,
       case when extract(epoch from (ts - lag(ts,1) over(partition by userid order by ts asc))) <= 1800 then MIN(ts) OVER (PARTITION BY userid, webid, rn1 - rn2) 
            when extract(epoch from (ts - lag(ts,1) over(partition by userid order by ts asc))) > 1800 then ts
            when lag(ts,1) over(partition by userid order by ts) is NULL then ts
       end as first_ts
FROM cte
ORDER BY userid, ts;

Anyone has a better solution to do the task? Thanks!

script reference: https://dbfiddle.uk/SL7EtQSy


Solution

  • This reads like a gaps-and-islands problem, where islands are consecutive rows that belong to the same user and webid with a lesser gap than 30 minutes.

    Here is one way to do it:

    select t.*, min(ts) over(partition by userid, session_id order by ts) first_ts
    from (
        select t.*,
            1 + count(*) 
                filter(where webid != lag_webid or ts > lag_ts + interval '30' minute) 
                over(partition by userid order by ts) session_id
        from (
            select t.*, 
                lag(webid, 1, webid) over(partition by userid order by ts) lag_webid,
                lag(ts, 1, ts)       over(partition by userid order by ts) lag_ts
            from mytable t
        ) t
    ) t
    

    The most inner subquery retrieves the "last" webid and timestamp for each user. With this information at hand, we can identify each group of records with a conditional count of gaps (column session_id). The last step is to recover the first timestamp of each island, that is first_ts.

    In your DB Fiddle, this produces:

    userid webid ts lag_webid lag_ts session_id first_ts
    A 34 2023-01-31 16:34:49 34 2023-01-31 16:34:49 1 2023-01-31 16:34:49
    A 97 2023-01-31 16:34:58 34 2023-01-31 16:34:49 2 2023-01-31 16:34:58
    A 17 2023-01-31 17:35:02 97 2023-01-31 16:34:58 3 2023-01-31 17:35:02
    A 17 2023-01-31 17:35:07 17 2023-01-31 17:35:02 3 2023-01-31 17:35:02
    A 17 2023-01-31 17:35:18 17 2023-01-31 17:35:07 3 2023-01-31 17:35:02
    A 1 2023-01-31 17:35:37 17 2023-01-31 17:35:18 4 2023-01-31 17:35:37
    A 1 2023-01-31 17:35:38 1 2023-01-31 17:35:37 4 2023-01-31 17:35:37
    A 77 2023-01-31 17:35:41 1 2023-01-31 17:35:38 5 2023-01-31 17:35:41
    A 77 2023-01-31 17:35:42 77 2023-01-31 17:35:41 5 2023-01-31 17:35:41
    A 1 2023-01-31 17:37:10 77 2023-01-31 17:35:42 6 2023-01-31 17:37:10
    A 1 2023-01-31 17:37:12 1 2023-01-31 17:37:10 6 2023-01-31 17:37:10
    A 77 2023-01-31 17:37:14 1 2023-01-31 17:37:12 7 2023-01-31 17:37:14
    A 77 2023-01-31 17:52:14 77 2023-01-31 17:37:14 7 2023-01-31 17:37:14
    A 77 2023-01-31 18:12:14 77 2023-01-31 17:52:14 7 2023-01-31 17:37:14
    A 77 2023-01-31 18:45:14 77 2023-01-31 18:12:14 8 2023-01-31 18:45:14
    A 77 2023-01-31 18:55:15 77 2023-01-31 18:45:14 8 2023-01-31 18:45:14
    B 33 2023-01-31 06:37:15 33 2023-01-31 06:37:15 1 2023-01-31 06:37:15
    B 56 2023-01-31 06:40:15 33 2023-01-31 06:37:15 2 2023-01-31 06:40:15

    Side note: the conditional count is somehow Postgres specific (few databases support the filter clause to aggregate/window functions). A more generic expression would be to replace this:

    count(*) 
        filter(where webid != lag_webid or ts > lag_ts + interval '30' minute) 
        over(partition by userid order by ts) session_id
    

    With:

    sum(case when webid != lag_webid or ts > lag_ts + interval '30' minute then 1 else 0 end) 
        over(partition by userid order by ts) session_id