I have an activity table called myTable, the data looks like this:
CREATE TABLE myTable
(
userid text,
webid text,
"ts" timestamp
);
INSERT INTO myTable
("userid", "webid", "ts")
VALUES ('A', '34', '2023-01-31 16:34:49.000'),
('A', '97', '2023-01-31 16:34:58.000'),
('A', '17', '2023-01-31 17:35:02.000'),
('A', '17', '2023-01-31 17:35:07.000'),
('A', '17', '2023-01-31 17:35:18.000'),
('A', '1', '2023-01-31 17:35:37.000'),
('A', '1', '2023-01-31 17:35:38.000'),
('A', '77', '2023-01-31 17:35:41.000'),
('A', '77', '2023-01-31 17:35:42.000'),
('A', '1', '2023-01-31 17:37:10.000'),
('A', '1', '2023-01-31 17:37:12.000'),
('A', '77', '2023-01-31 17:37:14.000'),
('A', '77', '2023-01-31 17:52:14.000'),
('A', '77', '2023-01-31 18:12:14.000'),
('A', '77', '2023-01-31 18:45:14.000'),
('A', '77', '2023-01-31 18:55:15.000'),
('B', '33', '2023-01-31 06:37:15.000'),
('B', '56', '2023-01-31 06:40:15.000')
;
userid | webid | ts |
---|---|---|
A | 34 | 2023-01-31 16:34:49 |
A | 97 | 2023-01-31 17:34:58 |
A | 17 | 2023-01-31 17:35:02 |
A | 17 | 2023-01-31 17:35:07 |
A | 17 | 2023-01-31 17:35:18 |
A | 1 | 2023-01-31 17:35:37 |
A | 1 | 2023-01-31 17:35:38 |
A | 77 | 2023-01-31 17:35:41 |
A | 77 | 2023-01-31 17:35:42 |
A | 1 | 2023-01-31 17:37:10 |
A | 1 | 2023-01-31 17:37:12 |
A | 77 | 2023-01-31 17:37:14 |
A | 77 | 2023-01-31 17:52:14 |
A | 77 | 2023-01-31 18:12:14 |
A | 77 | 2023-01-31 18:45:14 |
A | 77 | 2023-01-31 18:55:15 |
B | 33 | 2023-01-31 06:37:15 |
B | 56 | 2023-01-31 06:40:15 |
The output I want to return as following:
userid | webid | ts | first_ts | session_id |
---|---|---|---|---|
A | 34 | 2023-01-31 16:34:49 | 2023-01-31 16:34:49 | 1 |
A | 97 | 2023-01-31 17:34:58 | 2023-01-31 17:34:58 | 2 |
A | 17 | 2023-01-31 17:35:02 | 2023-01-31 17:35:02 | 3 |
A | 17 | 2023-01-31 17:35:07 | 2023-01-31 17:35:02 | 3 |
A | 17 | 2023-01-31 17:35:18 | 2023-01-31 17:35:02 | 3 |
A | 1 | 2023-01-31 17:35:37 | 2023-01-31 17:35:37 | 4 |
A | 1 | 2023-01-31 17:35:38 | 2023-01-31 17:35:37 | 4 |
A | 77 | 2023-01-31 17:35:41 | 2023-01-31 17:35:41 | 5 |
A | 77 | 2023-01-31 17:35:42 | 2023-01-31 17:35:41 | 5 |
A | 1 | 2023-01-31 17:37:10 | 2023-01-31 17:37:10 | 6 |
A | 1 | 2023-01-31 17:37:12 | 2023-01-31 17:37:10 | 6 |
A | 77 | 2023-01-31 17:37:14 | 2023-01-31 17:37:14 | 7 |
A | 77 | 2023-01-31 17:40:14 | 2023-01-31 17:52:14 | 7 |
A | 77 | 2023-01-31 18:15:14 | 2023-01-31 18:12:14 | 7 |
A | 77 | 2023-01-31 18:37:15 | 2023-01-31 18:45:14 | 8 |
A | 77 | 2023-01-31 18:37:15 | 2023-01-31 18:55:15 | 8 |
B | 33 | 2023-01-31 06:37:15 | 2023-01-31 06:37:15 | 1 |
B | 56 | 2023-01-31 06:40:15 | 2023-01-31 06:40:15 | 2 |
first_ts means first timestamp. The logic for first_ts is If a userid visits same webid at consecutive timestamp and each timestamp's interval falls within 30 minutes, these will be recorded as a session. For example, row 3, 4, and 5. Each time interval are less than 30 minutes, their first_ts is the timestamp of the first event, which is 2023-01-31 17:35:02.
If a userid visits a webid and jump to another webid and return back to the first webid, the first_ts will be refreshed to current timestamp. For example, row 6, 7, 8, 9, 10, 11. userid A visits webid=1 at first, but jump to webid=77, after return back to webid=1, then row 10 and row 11 should be recorded as a new session with new first_ts.
If a userid visits a webid at consecutive timestamps, and one timestamp interval is greater than 30 minutes, it will also break the session and the first_ts will be refreshed as well. For example, row 12, 13, 14, 15, 16. Between row 14 and row 15, their time interval are 33 minutes (>30), so for row 15 and row 16, their first_ts refreshed to current timestamp and they are to be a new session.
I share my script here, the script haven't included the session_id because I found out it doesn't return correct first_ts for me to do the dense_rank on session_id. extract(epoch()) function is same as datediff() function in sql. Since I am testing the data using Postgresql, it doesn't have datediff() function. so I use exract(epoch()) to calculate time difference.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY ts) rn1,
ROW_NUMBER() OVER (PARTITION BY userid, webid ORDER BY ts) rn2
FROM myTable
)
SELECT userid, webid, ts,
lag(ts,1) over(partition by userid order by ts) as previous_ts,
case when extract(epoch from (ts - lag(ts,1) over(partition by userid order by ts asc))) <= 1800 then MIN(ts) OVER (PARTITION BY userid, webid, rn1 - rn2)
when extract(epoch from (ts - lag(ts,1) over(partition by userid order by ts asc))) > 1800 then ts
when lag(ts,1) over(partition by userid order by ts) is NULL then ts
end as first_ts
FROM cte
ORDER BY userid, ts;
Anyone has a better solution to do the task? Thanks!
script reference: https://dbfiddle.uk/SL7EtQSy
This reads like a gaps-and-islands problem, where islands are consecutive rows that belong to the same user and webid with a lesser gap than 30 minutes.
Here is one way to do it:
select t.*, min(ts) over(partition by userid, session_id order by ts) first_ts
from (
select t.*,
1 + count(*)
filter(where webid != lag_webid or ts > lag_ts + interval '30' minute)
over(partition by userid order by ts) session_id
from (
select t.*,
lag(webid, 1, webid) over(partition by userid order by ts) lag_webid,
lag(ts, 1, ts) over(partition by userid order by ts) lag_ts
from mytable t
) t
) t
The most inner subquery retrieves the "last" webid and timestamp for each user. With this information at hand, we can identify each group of records with a conditional count of gaps (column session_id
). The last step is to recover the first timestamp of each island, that is first_ts
.
In your DB Fiddle, this produces:
userid | webid | ts | lag_webid | lag_ts | session_id | first_ts |
---|---|---|---|---|---|---|
A | 34 | 2023-01-31 16:34:49 | 34 | 2023-01-31 16:34:49 | 1 | 2023-01-31 16:34:49 |
A | 97 | 2023-01-31 16:34:58 | 34 | 2023-01-31 16:34:49 | 2 | 2023-01-31 16:34:58 |
A | 17 | 2023-01-31 17:35:02 | 97 | 2023-01-31 16:34:58 | 3 | 2023-01-31 17:35:02 |
A | 17 | 2023-01-31 17:35:07 | 17 | 2023-01-31 17:35:02 | 3 | 2023-01-31 17:35:02 |
A | 17 | 2023-01-31 17:35:18 | 17 | 2023-01-31 17:35:07 | 3 | 2023-01-31 17:35:02 |
A | 1 | 2023-01-31 17:35:37 | 17 | 2023-01-31 17:35:18 | 4 | 2023-01-31 17:35:37 |
A | 1 | 2023-01-31 17:35:38 | 1 | 2023-01-31 17:35:37 | 4 | 2023-01-31 17:35:37 |
A | 77 | 2023-01-31 17:35:41 | 1 | 2023-01-31 17:35:38 | 5 | 2023-01-31 17:35:41 |
A | 77 | 2023-01-31 17:35:42 | 77 | 2023-01-31 17:35:41 | 5 | 2023-01-31 17:35:41 |
A | 1 | 2023-01-31 17:37:10 | 77 | 2023-01-31 17:35:42 | 6 | 2023-01-31 17:37:10 |
A | 1 | 2023-01-31 17:37:12 | 1 | 2023-01-31 17:37:10 | 6 | 2023-01-31 17:37:10 |
A | 77 | 2023-01-31 17:37:14 | 1 | 2023-01-31 17:37:12 | 7 | 2023-01-31 17:37:14 |
A | 77 | 2023-01-31 17:52:14 | 77 | 2023-01-31 17:37:14 | 7 | 2023-01-31 17:37:14 |
A | 77 | 2023-01-31 18:12:14 | 77 | 2023-01-31 17:52:14 | 7 | 2023-01-31 17:37:14 |
A | 77 | 2023-01-31 18:45:14 | 77 | 2023-01-31 18:12:14 | 8 | 2023-01-31 18:45:14 |
A | 77 | 2023-01-31 18:55:15 | 77 | 2023-01-31 18:45:14 | 8 | 2023-01-31 18:45:14 |
B | 33 | 2023-01-31 06:37:15 | 33 | 2023-01-31 06:37:15 | 1 | 2023-01-31 06:37:15 |
B | 56 | 2023-01-31 06:40:15 | 33 | 2023-01-31 06:37:15 | 2 | 2023-01-31 06:40:15 |
Side note: the conditional count is somehow Postgres specific (few databases support the filter
clause to aggregate/window functions). A more generic expression would be to replace this:
count(*)
filter(where webid != lag_webid or ts > lag_ts + interval '30' minute)
over(partition by userid order by ts) session_id
With:
sum(case when webid != lag_webid or ts > lag_ts + interval '30' minute then 1 else 0 end)
over(partition by userid order by ts) session_id