I need some help. I have a table in PostgreSQL with logs of user sessions. The table has 3 fields: id, action, and action_date in the timestamp format. The action field has 2 possible values: "page_load" or "page_exit". The action_date field contains the start time and end time of each session, respectively. I need to calculate the average session length for each user who visited the website. The problem is that for some sessions (probably due to errors in data recording), the action field contains a different number of "page_load" or "page_exit" values for the same user (id). Additionally, there can be multiple sessions for the same user within one day. There are also cases where two consecutive "page_load" actions occur with a small time difference, followed by a "page_exit" action, indicating the end of the second "page_load" session:
8658 page_load 2023-02-01 21:58:44.000
8658 page_load 2023-02-02 08:31:32.000
8658 page_exit 2017-01-28 08:42:18.000
Due to the varying number of "page_load" or "page_exit" actions for each user, I'm having trouble performing the calculation. I've tried creating separate datasets by dividing them based on the session type and then performing a join, but it's incorrect due to sessions with different counts of "page_load" and "page_exit" actions. I also attempted to use the lead() function, but I get incorrect calculations due to cases where "page_load" or "page_exit" actions can occur consecutively. I've tried filtering out the rows where the count of "page_load" or "page_exit" actions doesn't match, but I end up with a shift in the data and incorrect calculations as well. Is there any way to eliminate these extra rows and perform calculations only with matching pairs? I've run out of ideas, and I'm hoping for possible assistance from the community. Here is one of the approaches I'm trying to eliminate rows where "page_load" and "page_exit" don't match, in order to obtain two separate columns that I can later use to calculate the average length. However, this query is incorrect because I'm getting a significant difference between "page_load" and "page_exit" that doesn't exist in the original data.
WITH ses_events AS (
SELECT id,
action,
action_date,
ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq
FROM users_act ),
arr AS (
SELECT id,
array_agg(CASE WHEN action = 'page_load' THEN action_date END) AS ses_start,
array_agg(CASE WHEN action = 'page_exit' THEN action_date END) AS ses_end
FROM ses_events
GROUP BY id, event_seq
HAVING COUNT(*) FILTER (WHERE action IN ('page_load', 'page_exit')) % 2 = 0 ),
dr_arr AS (
SELECT user_id,
unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,
unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr)
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;
Another attempt to calculate using offsets, but it also returns incorrect data due to rows that lack either "page_load" or "page_exit":
WITH ld AS ( SELECT id,
action,
action_date,
LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
FROM users_act ), l AS ( SELECT id,
(end_s - action_date) AS length_ses FROM ld WHERE event = 'page_load' ) SELECT id,
avg(length_ses) AS avg_us_ses FROM l GROUP BY id;
I have tried several different approaches, but none of them have yielded the correct result. I believe that the only thing that can help solve this problem is to pivot the rows and create two separate fields, start_ses (timestamp) and end_ses (timestamp). However, my attempts to write such a solution have not produced the correct result so far. I hope someone can provide guidance or point me towards the right solution for this problem.
You typically want to instoduce a cleaning view in such case containing still the detailed (not aggregated) data, but providing some additional information helping to classify the data.
An example that classifies the terminating records in OK
if a previous record is the opening record and calculates the session duration.
with session_view as (
select
id, "action", action_date,
lag(action_date) over (partition by id order by action_date) action_date_lag,
lag("action") over (partition by id order by action_date) action_lag
from dt)
select
id, "action", action_date,
case
when "action" = 'page_load' then null /* session start */
when "action" = 'page_exit' and action_lag = 'page_load' then 'OK' else 'illegal session' end as session_status,
case when "action" = 'page_exit' then action_date - action_date_lag end as session_duration
from session_view
order by id, action_date
id |action |action_date |session_status |session_duration|
----+---------+-------------------+---------------+----------------+
8658|page_exit|2017-01-28 08:42:18|illegal session| |
8658|page_load|2023-02-01 21:58:44| | |
8658|page_exit|2023-02-01 22:58:44|OK | 01:00:00|
8658|page_load|2023-02-02 08:31:32| | |
8658|page_exit|2023-02-02 08:41:32|OK | 00:10:00|
You will consider only the sessions with the status OK
in the average calculation.
Additionaly you may check the duration and discard outliers using soem threshold for too long or too short sessions.
Alternatively use median which is less error dependent on the outliers.
You may want to use EXTRACT(epoch
to calculate the duration in seconds or other units.