I am trying to create new id based on logon page name. Whenever logon page_name is encountered, it will be considered as one session until it finds next logon page name. The new id will be generated based on session count for that id.
Sample data set:
id | page_name | visited_time |
---|---|---|
ABC-123 | logon | 2023-02-23 04:04:40.000 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 |
ABC-123 | logon | 2023-02-23 04:54:40.000 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 |
CDE-123 | logon | 2023-02-23 04:58:40.000 |
SQL used:
with session_cnt
as
(
select id, page_name, visited_time,
row_number() over (partition by id, page_name order by visited_time asc) as session_count,
visited_time
from table
)
select id, page_name, visited_time, session_count, array_join(array_agg(session_count), '-', 'null)) as new_id
from session_cnt
Above SQL seems to return new id based on other page names too. I would like to group the session count based on id and page_name=logon.
Expected result:
id | page_name | visited_time | session_count | new_id |
---|---|---|---|---|
ABC-123 | logon | 2023-02-23 04:04:40.000 | 1 | ABC-123-1 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 | 1 | ABC-123-1 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 | 1 | ABC-123-1 |
ABC-123 | logon | 2023-02-23 04:54:40.000 | 2 | ABC-123-2 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 | 2 | ABC-123-2 |
CDE-123 | logon | 2023-02-23 04:58:40.000 | 1 | CDE-123-1 |
Please advise how I can achieve this in Athena
Use cumulative sum (using window functions) over page_name
equal to 'logon'
:
-- sample data
with dataset (id, page_name, visited_time) as (
values ('ABC-123', 'logon', '2023-02-23 04:04:40.000'),
('ABC-123', 'smscode', '2023-02-23 04:20:40.000'),
('ABC-123', 'acct balance', '2023-02-23 04:21:40.000'),
('ABC-123', 'logon', '2023-02-23 04:54:40.000'),
('ABC-123', 'transfer', '2023-02-23 04:54:40.000'),
('CDE-123', 'logon', '2023-02-23 04:58:40.000')
)
-- query
select *,
sum(if(page_name = 'logon', 1)) over(partition by id order by visited_time)
as session_id
from dataset;
Output:
id | page_name | visited_time | session_id |
---|---|---|---|
CDE-123 | logon | 2023-02-23 04:58:40.000 | 1 |
ABC-123 | logon | 2023-02-23 04:04:40.000 | 1 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 | 1 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 | 1 |
ABC-123 | logon | 2023-02-23 04:54:40.000 | 2 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 | 2 |