I have a table that stores stream data from the website. Within this, all records that start and end with a particular set of combinations of columns are considered a session. However, it is not flagged at all.
In this attached picture, I'm trying to generate the "New Column". So basically if the row has Click, Landing Page, Main Page that's the start of a session and if the row has Time Out, Landing Page that's the end of the session.
Is it possible to generate this in SQL? Any ideas or pointers are much appreciated. Thanks!!
Edit: I'm not having much luck with SQL. My last resort is to attempt with Python. Any experts in Python to help?
Fisrt, because you don't have a user ID for the events in your table, you can't determine which "session" a particular event belongs to, except by tracking the start of the "session" when the "Click - Landing Page - Main Page" event occurs.
Second, according to your answer to the question
What if row with "Time Out, Landing Page" never happens before next "Click, Landing Page, Main Page"? What if there are actions recorded after timeout but before next session?
They do.. But it's very rare and the business doesn't mind losing those sessions. But the main concentration is to identify what happens within the session. So it's fine with such outliers
tracing the occurrence of the "Landing Page - Timeout" event to detect the end of the session does not make sense (since it can be written after the start of a new session or not written at all).
From these two points you can determine the beginning of the session, and its end is the beginning of a new one, and all events belongs to the current session until it ends with the beginning of a new one.
If all of the above is correct, you can use a simple query like this
SELECT
'Session ' ||
SUM(CASE WHEN event = 'Click' AND currentpage = 'Landing Page' AND destinationpage = 'Main Page' THEN 1
END) OVER (ORDER BY timestamp) AS sessionid,
*
FROM events
The idea is to sum the values, which are set to 1
if the start of a new session is found. Then all events within the session will have the same sum value.
Query output
sessionid | timestamp | event | currentpage | destinationpage | details |
---|---|---|---|---|---|
Session 1 | 2022-03-08 14:08:07 | Click | Landing Page | Main Page | |
Session 1 | 2022-03-08 14:08:07 | Redirect | Main Page | Videos | |
Session 1 | 2022-03-08 14:08:09 | Click | Videos | Video 1 Playing | |
Session 1 | 2022-03-08 14:08:09 | Click | Videos | Video 2 Playing | |
Session 1 | 2022-03-08 14:08:30 | Click | Videos | Content | |
Session 1 | 2022-03-08 14:08:36 | Random System Msg | Content | ||
Session 1 | 2022-03-08 14:08:36 | Random System Msg | Content | ||
Session 1 | 2022-03-08 14:08:37 | Click | Content | Page 5 | |
Session 1 | 2022-03-08 14:09:56 | Random System Msg | Page 5 | ||
Session 1 | 2022-03-08 14:09:57 | Time Out | Page 5 | Landing Page | |
Session 2 | 2022-09-08 15:28:02 | Click | Landing Page | Main Page | |
Session 2 | 2022-09-08 15:28:02 | Redirect | Main Page | Videos | |
Session 2 | 2022-09-08 15:28:02 | Click | Videos | Video 1 Playing | |
Session 2 | 2022-09-08 15:28:02 | Click | Videos | Video 2 Playing | |
Session 2 | 2022-09-08 15:28:02 | Time Out | Page 5 | Landing Page |
I used PostgreSQL and the SUM()
aggregation function with the OVER()
window function to demonstrate my approach, but Amazon-Athena should support it.
You can check a working demo here