how to get 1st column value with 30 minutes rule and sessionize the activity data

I have an activity table called myTable, the data looks like this:

CREATE TABLE myTable
(
    userid      text,
    webid         text,
    "ts" timestamp
);

INSERT INTO myTable
    ("userid", "webid", "ts")
VALUES ('A', '34', '2023-01-31 16:34:49.000'),
       ('A', '97', '2023-01-31 16:34:58.000'),
       ('A', '17', '2023-01-31 17:35:02.000'),
       ('A', '17', '2023-01-31 17:35:07.000'),
       ('A', '17', '2023-01-31 17:35:18.000'),
       ('A', '1', '2023-01-31 17:35:37.000'),
       ('A', '1', '2023-01-31 17:35:38.000'),
       ('A', '77', '2023-01-31 17:35:41.000'),
       ('A', '77', '2023-01-31 17:35:42.000'),
       ('A', '1', '2023-01-31 17:37:10.000'),
       ('A', '1', '2023-01-31 17:37:12.000'),
       ('A', '77', '2023-01-31 17:37:14.000'),
       ('A', '77', '2023-01-31 17:52:14.000'),
       ('A', '77', '2023-01-31 18:12:14.000'),
       ('A', '77', '2023-01-31 18:45:14.000'),
       ('A', '77', '2023-01-31 18:55:15.000'),
       ('B', '33', '2023-01-31 06:37:15.000'),
       ('B', '56', '2023-01-31 06:40:15.000')
  ;

userid	webid	ts
A	34	2023-01-31 16:34:49
A	97	2023-01-31 17:34:58
A	17	2023-01-31 17:35:02
A	17	2023-01-31 17:35:07
A	17	2023-01-31 17:35:18
A	1	2023-01-31 17:35:37
A	1	2023-01-31 17:35:38
A	77	2023-01-31 17:35:41
A	77	2023-01-31 17:35:42
A	1	2023-01-31 17:37:10
A	1	2023-01-31 17:37:12
A	77	2023-01-31 17:37:14
A	77	2023-01-31 17:52:14
A	77	2023-01-31 18:12:14
A	77	2023-01-31 18:45:14
A	77	2023-01-31 18:55:15
B	33	2023-01-31 06:37:15
B	56	2023-01-31 06:40:15

The output I want to return as following:

userid	webid	ts	first_ts	session_id
A	34	2023-01-31 16:34:49	2023-01-31 16:34:49	1
A	97	2023-01-31 17:34:58	2023-01-31 17:34:58	2
A	17	2023-01-31 17:35:02	2023-01-31 17:35:02	3
A	17	2023-01-31 17:35:07	2023-01-31 17:35:02	3
A	17	2023-01-31 17:35:18	2023-01-31 17:35:02	3
A	1	2023-01-31 17:35:37	2023-01-31 17:35:37	4
A	1	2023-01-31 17:35:38	2023-01-31 17:35:37	4
A	77	2023-01-31 17:35:41	2023-01-31 17:35:41	5
A	77	2023-01-31 17:35:42	2023-01-31 17:35:41	5
A	1	2023-01-31 17:37:10	2023-01-31 17:37:10	6
A	1	2023-01-31 17:37:12	2023-01-31 17:37:10	6
A	77	2023-01-31 17:37:14	2023-01-31 17:37:14	7
A	77	2023-01-31 17:40:14	2023-01-31 17:52:14	7
A	77	2023-01-31 18:15:14	2023-01-31 18:12:14	7
A	77	2023-01-31 18:37:15	2023-01-31 18:45:14	8
A	77	2023-01-31 18:37:15	2023-01-31 18:55:15	8
B	33	2023-01-31 06:37:15	2023-01-31 06:37:15	1
B	56	2023-01-31 06:40:15	2023-01-31 06:40:15	2

first_ts means first timestamp. The logic for first_ts is If a userid visits same webid at consecutive timestamp and each timestamp's interval falls within 30 minutes, these will be recorded as a session. For example, row 3, 4, and 5. Each time interval are less than 30 minutes, their first_ts is the timestamp of the first event, which is 2023-01-31 17:35:02.

If a userid visits a webid and jump to another webid and return back to the first webid, the first_ts will be refreshed to current timestamp. For example, row 6, 7, 8, 9, 10, 11. userid A visits webid=1 at first, but jump to webid=77, after return back to webid=1, then row 10 and row 11 should be recorded as a new session with new first_ts.

If a userid visits a webid at consecutive timestamps, and one timestamp interval is greater than 30 minutes, it will also break the session and the first_ts will be refreshed as well. For example, row 12, 13, 14, 15, 16. Between row 14 and row 15, their time interval are 33 minutes (>30), so for row 15 and row 16, their first_ts refreshed to current timestamp and they are to be a new session.

I share my script here, the script haven't included the session_id because I found out it doesn't return correct first_ts for me to do the dense_rank on session_id. extract(epoch()) function is same as datediff() function in sql. Since I am testing the data using Postgresql, it doesn't have datediff() function. so I use exract(epoch()) to calculate time difference.

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY ts) rn1,
              ROW_NUMBER() OVER (PARTITION BY userid, webid ORDER BY ts) rn2
    FROM myTable
)

SELECT userid, webid, ts,
       lag(ts,1) over(partition by userid order by ts) as previous_ts,
       case when extract(epoch from (ts - lag(ts,1) over(partition by userid order by ts asc))) <= 1800 then MIN(ts) OVER (PARTITION BY userid, webid, rn1 - rn2) 
            when extract(epoch from (ts - lag(ts,1) over(partition by userid order by ts asc))) > 1800 then ts
            when lag(ts,1) over(partition by userid order by ts) is NULL then ts
       end as first_ts
FROM cte
ORDER BY userid, ts;

Anyone has a better solution to do the task? Thanks!

script reference: https://dbfiddle.uk/SL7EtQSy

Solution

This reads like a gaps-and-islands problem, where islands are consecutive rows that belong to the same user and webid with a lesser gap than 30 minutes.

Here is one way to do it:

select t.*, min(ts) over(partition by userid, session_id order by ts) first_ts
from (
    select t.*,
        1 + count(*) 
            filter(where webid != lag_webid or ts > lag_ts + interval '30' minute) 
            over(partition by userid order by ts) session_id
    from (
        select t.*, 
            lag(webid, 1, webid) over(partition by userid order by ts) lag_webid,
            lag(ts, 1, ts)       over(partition by userid order by ts) lag_ts
        from mytable t
    ) t
) t

The most inner subquery retrieves the "last" webid and timestamp for each user. With this information at hand, we can identify each group of records with a conditional count of gaps (column session_id). The last step is to recover the first timestamp of each island, that is first_ts.

In your DB Fiddle, this produces:

userid	webid	ts	lag_webid	lag_ts	session_id	first_ts
A	34	2023-01-31 16:34:49	34	2023-01-31 16:34:49	1	2023-01-31 16:34:49
A	97	2023-01-31 16:34:58	34	2023-01-31 16:34:49	2	2023-01-31 16:34:58
A	17	2023-01-31 17:35:02	97	2023-01-31 16:34:58	3	2023-01-31 17:35:02
A	17	2023-01-31 17:35:07	17	2023-01-31 17:35:02	3	2023-01-31 17:35:02
A	17	2023-01-31 17:35:18	17	2023-01-31 17:35:07	3	2023-01-31 17:35:02
A	1	2023-01-31 17:35:37	17	2023-01-31 17:35:18	4	2023-01-31 17:35:37
A	1	2023-01-31 17:35:38	1	2023-01-31 17:35:37	4	2023-01-31 17:35:37
A	77	2023-01-31 17:35:41	1	2023-01-31 17:35:38	5	2023-01-31 17:35:41
A	77	2023-01-31 17:35:42	77	2023-01-31 17:35:41	5	2023-01-31 17:35:41
A	1	2023-01-31 17:37:10	77	2023-01-31 17:35:42	6	2023-01-31 17:37:10
A	1	2023-01-31 17:37:12	1	2023-01-31 17:37:10	6	2023-01-31 17:37:10
A	77	2023-01-31 17:37:14	1	2023-01-31 17:37:12	7	2023-01-31 17:37:14
A	77	2023-01-31 17:52:14	77	2023-01-31 17:37:14	7	2023-01-31 17:37:14
A	77	2023-01-31 18:12:14	77	2023-01-31 17:52:14	7	2023-01-31 17:37:14
A	77	2023-01-31 18:45:14	77	2023-01-31 18:12:14	8	2023-01-31 18:45:14
A	77	2023-01-31 18:55:15	77	2023-01-31 18:45:14	8	2023-01-31 18:45:14
B	33	2023-01-31 06:37:15	33	2023-01-31 06:37:15	1	2023-01-31 06:37:15
B	56	2023-01-31 06:40:15	33	2023-01-31 06:37:15	2	2023-01-31 06:40:15

Side note: the conditional count is somehow Postgres specific (few databases support the filter clause to aggregate/window functions). A more generic expression would be to replace this:

count(*) 
    filter(where webid != lag_webid or ts > lag_ts + interval '30' minute) 
    over(partition by userid order by ts) session_id

With:

sum(case when webid != lag_webid or ts > lag_ts + interval '30' minute then 1 else 0 end) 
    over(partition by userid order by ts) session_id