Search code examples
sqlamazon-athenaprestotrino

Grouping data based on a value in Athena


I am trying to create new id based on logon page name. Whenever logon page_name is encountered, it will be considered as one session until it finds next logon page name. The new id will be generated based on session count for that id.

Sample data set:

id page_name visited_time
ABC-123 logon 2023-02-23 04:04:40.000
ABC-123 smscode 2023-02-23 04:20:40.000
ABC-123 acct balance 2023-02-23 04:21:40.000
ABC-123 logon 2023-02-23 04:54:40.000
ABC-123 transfer 2023-02-23 04:54:40.000
CDE-123 logon 2023-02-23 04:58:40.000

SQL used:

with session_cnt
as
(
select id, page_name, visited_time, 
row_number() over (partition by id, page_name order by visited_time asc) as session_count, 
visited_time
from table
)
select id, page_name, visited_time, session_count, array_join(array_agg(session_count), '-', 'null)) as new_id
from session_cnt

Above SQL seems to return new id based on other page names too. I would like to group the session count based on id and page_name=logon.

Expected result:

id page_name visited_time session_count new_id
ABC-123 logon 2023-02-23 04:04:40.000 1 ABC-123-1
ABC-123 smscode 2023-02-23 04:20:40.000 1 ABC-123-1
ABC-123 acct balance 2023-02-23 04:21:40.000 1 ABC-123-1
ABC-123 logon 2023-02-23 04:54:40.000 2 ABC-123-2
ABC-123 transfer 2023-02-23 04:54:40.000 2 ABC-123-2
CDE-123 logon 2023-02-23 04:58:40.000 1 CDE-123-1

Please advise how I can achieve this in Athena


Solution

  • Use cumulative sum (using window functions) over page_name equal to 'logon':

    -- sample data
    with dataset (id, page_name, visited_time) as (
        values ('ABC-123',  'logon',    '2023-02-23 04:04:40.000'),
        ('ABC-123', 'smscode',  '2023-02-23 04:20:40.000'),
        ('ABC-123', 'acct balance', '2023-02-23 04:21:40.000'),
        ('ABC-123', 'logon',    '2023-02-23 04:54:40.000'),
        ('ABC-123', 'transfer', '2023-02-23 04:54:40.000'),
        ('CDE-123', 'logon',    '2023-02-23 04:58:40.000')
    )
    
    -- query
    select *,
      sum(if(page_name = 'logon', 1)) over(partition by id order by visited_time) 
          as session_id 
    from dataset;
    

    Output:

    id page_name visited_time session_id
    CDE-123 logon 2023-02-23 04:58:40.000 1
    ABC-123 logon 2023-02-23 04:04:40.000 1
    ABC-123 smscode 2023-02-23 04:20:40.000 1
    ABC-123 acct balance 2023-02-23 04:21:40.000 1
    ABC-123 logon 2023-02-23 04:54:40.000 2
    ABC-123 transfer 2023-02-23 04:54:40.000 2