Search code examples
sqlhivecounthiveqlwindow-functions

Unique cumulative customers by each day


Task: Get the total number of unique cumulative customers by each decline reason and by each day.

    Input data sample:
    +---------+--------------+------------+------+
    | Cust_Id |  Decline_Dt  |  Reason    | Days |
    +---------+--------------+------------+------+
    |   A     |  08-09-2020  |  Reason_1  |   0  |
    |   A     |  08-09-2020  |  Reason_1  |   1  | 
    |   A     |  08-09-2020  |  Reason_1  |   2  |
    |   A     |  08-09-2020  |  Reason_1  |   4  |
    |   B     |  08-09-2020  |  Reason_1  |   0  |
    |   B     |  08-09-2020  |  Reason_1  |   2  |
    |   B     |  08-09-2020  |  Reason_1  |   3  |
    |   C     |  08-09-2020  |  Reason_1  |   1  |
    +---------+--------------+------------+------+   
1) Decline_dt - The date on which the payment was declined. (Ignore it for this task)
2) Days - Indicates the # of days after the payment decline happened, the customer interacted with IVR channel. 
3) Reason - Indicates the payment decline reason 
    
    --Expected Output:
    +---------------+-----------+---------------+----------------------------+
    |   Reason      |   Days    | Unique_mtns   | total_cumulative_customers |
    +---------------+-----------+---------------+----------------------------+
    |   Reason_1    |   0       |   2           |           2                |              
    |   Reason_1    |   1       |   2           |           3                |
    |   Reason_1    |   2       |   2           |           3                | 
    |   Reason_1    |   3       |   1           |           3                | 
    |   Reason_1    |   4       |   1           |           3                | 
    +------------------------------------------------------------------------+

My Hive query:

select a.Reason
        , a.days
        -- , count(distinct a.cust_id) as unique_mtns
        , count(distinct a.cust_id) over (partition by Reason 
                                                order by a.days rows between unbounded preceding and current row) 
                                        as total_cumulative_customers
from table as a  
group by a.reason
        , a.days

Output (Incorrect):

+---------------+-----------+----------------------------+
|   Reason      |   Days    | total_cumulative_customers |
+---------------+-----------+----------------------------+
|   Reason_1    |   0       |               2            |              
|   Reason_1    |   1       |               2            |
|   Reason_1    |   2       |               2            | 
|   Reason_1    |   3       |               1            | 
|   Reason_1    |   4       |               1            | 
+--------------------------------------------------------+

Ideally, I would expect the window function to be executed without group by. However, I get an error without group by. When I use group by, I don't get the cumulative customers.


Solution

  • If I follow you correctly, you can use a subquery to compute the first day per customer/reason tuple, and then do conditional aggregation:

    select reason, days, 
        count(distinct cust_id) as unique_mtns,
        sum(sum(case when days = min_days then 1 else 0 end)) 
            over(partition by reason order by days) as total_cumulative_customers
    from (
        select reason, cust_id, 
            min(days) over(partition by reason, cust_id) as min_days
        from mytable
    ) t
    group by reason, days