Search code examples
pythonsqldata-manipulationamazon-athena

How to generate a identifier for set of rows using SQL?


I have a table that stores stream data from the website. Within this, all records that start and end with a particular set of combinations of columns are considered a session. However, it is not flagged at all.enter image description here

In this attached picture, I'm trying to generate the "New Column". So basically if the row has Click, Landing Page, Main Page that's the start of a session and if the row has Time Out, Landing Page that's the end of the session.

Is it possible to generate this in SQL? Any ideas or pointers are much appreciated. Thanks!!

Edit: I'm not having much luck with SQL. My last resort is to attempt with Python. Any experts in Python to help?


Solution

  • Fisrt, because you don't have a user ID for the events in your table, you can't determine which "session" a particular event belongs to, except by tracking the start of the "session" when the "Click - Landing Page - Main Page" event occurs.

    Second, according to your answer to the question

    What if row with "Time Out, Landing Page" never happens before next "Click, Landing Page, Main Page"? What if there are actions recorded after timeout but before next session?

    They do.. But it's very rare and the business doesn't mind losing those sessions. But the main concentration is to identify what happens within the session. So it's fine with such outliers

    tracing the occurrence of the "Landing Page - Timeout" event to detect the end of the session does not make sense (since it can be written after the start of a new session or not written at all).

    From these two points you can determine the beginning of the session, and its end is the beginning of a new one, and all events belongs to the current session until it ends with the beginning of a new one.

    If all of the above is correct, you can use a simple query like this

    SELECT 
        'Session ' || 
        SUM(CASE WHEN event = 'Click' AND currentpage = 'Landing Page' AND destinationpage = 'Main Page' THEN 1
            END) OVER (ORDER BY timestamp) AS sessionid,
        *
    FROM events    
    

    The idea is to sum the values, which are set to 1 if the start of a new session is found. Then all events within the session will have the same sum value.

    Query output

    sessionid timestamp event currentpage destinationpage details
    Session 1 2022-03-08 14:08:07 Click Landing Page Main Page
    Session 1 2022-03-08 14:08:07 Redirect Main Page Videos
    Session 1 2022-03-08 14:08:09 Click Videos Video 1 Playing
    Session 1 2022-03-08 14:08:09 Click Videos Video 2 Playing
    Session 1 2022-03-08 14:08:30 Click Videos Content
    Session 1 2022-03-08 14:08:36 Random System Msg Content
    Session 1 2022-03-08 14:08:36 Random System Msg Content
    Session 1 2022-03-08 14:08:37 Click Content Page 5
    Session 1 2022-03-08 14:09:56 Random System Msg Page 5
    Session 1 2022-03-08 14:09:57 Time Out Page 5 Landing Page
    Session 2 2022-09-08 15:28:02 Click Landing Page Main Page
    Session 2 2022-09-08 15:28:02 Redirect Main Page Videos
    Session 2 2022-09-08 15:28:02 Click Videos Video 1 Playing
    Session 2 2022-09-08 15:28:02 Click Videos Video 2 Playing
    Session 2 2022-09-08 15:28:02 Time Out Page 5 Landing Page

    I used PostgreSQL and the SUM() aggregation function with the OVER() window function to demonstrate my approach, but Amazon-Athena should support it.

    You can check a working demo here