Search code examples
sqldatehivehiveqlwindow-functions

Create Missing Data Hive SQL


I have a table that has an activity date of when things change such as

2020-08-13  123 Upgrade
2020-08-17  123 Downgrade
2020-08-21  123 Upgrade

Basically this in relation to a line there are 3 activities happening on this account. They have a basic account then they downgrade but then they upgrade again

I would like to have these happen in steps such as

2020-08-13  123 Upgrade1
2020-08-14  123 Upgrade1
2020-08-15  123 Upgrade1
2020-08-16  123 Upgrade1
2020-08-17  123 Downgrade1
2020-08-18  123 Downgrade1
2020-08-19  123 Downgrade1
2020-08-20  123 Downgrade1
2020-08-21  123 Upgrade2
.
.
.
2020-09-09 123 Upgrade2

Then I would like to partition them by their activity and see this in the end results so I can calculate how many users stayed in their downgraded state for more than 30 days to see their behavior compared to a upgraded change.

2020-08-13  123 Upgrade1. 1
2020-08-14  123 Upgrade1. 2
2020-08-15  123 Upgrade1. 3
2020-08-16  123 Upgrade1. 4
2020-08-17  123 Downgrade1. 1
2020-08-18  123 Downgrade1. 2
2020-08-19  123 Downgrade1. 3
2020-08-20  123 Downgrade1. 4
2020-08-21  123 Upgrade2.   1
.
.
.
2020-09-09 123 Upgrade2.  18

I have tried doing Coalesce then row_num but I can't wrap my head around how to partition out each activity based on when they changed their account status.


Solution

  • Generate rows using posexplode(split(space(datediff(next_date,activity_date)-1),' ')). Calculate new_group flag when previous activity<>current activity. Use analytic sum() to calculate group(partition) number. See comments in the code:

    with mydata as (
    select stack(3,
    '2020-08-13', 123, 'Upgrade',
    '2020-08-17', 123, 'Downgrade',
    '2020-08-21', 123, 'Upgrade'
    ) as (activity_date, account, activity)
    )
    
    --calculate row_number in account, activity
    select activity_date, account, activity, activity_partition,
           row_number() over(partition by account, activity_partition order by activity_date ) activity_partition_rn,
           count(*) over(partition by account, activity_partition )     days_on_activity
    from
    (--Calculate partition
    select activity_date, account, activity,
          concat(activity,
                 sum(new_group_flag) over(partition by account,  activity order by activity_date rows between unbounded preceding and current row)
                ) activity_partition
    from
    (--Calculate new group flag
    select activity_date, account, activity, 
            case when lag(activity) over (partition by account order by activity_date) = activity then 0 else 1 end as new_group_flag
    from
    ( --generate Date range
    select  date_add(activity_date,i) as activity_date, account, activity
    from
    ( --Get next_date to generate date range
    select activity_date, account, activity,
           lead(activity_date,1, activity_date) over (partition by account order by activity_date) next_date    
      from mydata d  
    ) s lateral view posexplode(split(space(datediff(next_date,activity_date)-1),' ')) e as i,x --generate rows
    )s
    )s
    )s
    order by activity_date;
    

    Result:

    activity_date    account    activity    activity_partition    activity_partition_rn    days_on_activity
    2020-08-13            123    Upgrade      Upgrade1                1                         4
    2020-08-14            123    Upgrade      Upgrade1                2                         4
    2020-08-15            123    Upgrade      Upgrade1                3                         4
    2020-08-16            123    Upgrade      Upgrade1                4                         4
    2020-08-17            123    Downgrade    Downgrade1              1                         4
    2020-08-18            123    Downgrade    Downgrade1              2                         4
    2020-08-19            123    Downgrade    Downgrade1              3                         4
    2020-08-20            123    Downgrade    Downgrade1              4                         4
    2020-08-21            123    Upgrade      Upgrade2                1                         1