Search code examples
sqlcaseamazon-athenaprestodate-arithmetic

Athena - Add a Row with 0 for All Columns if a Value is Not Present


I am trying to write an Athena query that will create pre and post data that can be used for analysis after a marketing e-mail went out.

After the code is ran, I am getting the following results because I have no idea how to tell Athena that if a member doesn't have a transaction during a particular time period, add that row in anyway with their age and just put zero for "net_amount" and "total_transactions":

customer_ID age net_amount total_transactions Period
ABCDRF 25 85.0 2 Post
HDWSAZ 37 23.98 1 Pre
HDWSAZ 37 56.34 2 Post
OAYDI 42 8.2 1 Pre
OAYDI 42 98.29 4 Post
IUSTRA 31 88.62 1 Pre
MAHRYS 56 34.16 2 Pre

As you can see, this table above does not have a pre period row for "ABCDRF" and does not have a post period row for either "IUSTRA" and "MAHRYS" but I want to add that in with zeros for the "net_amount" and "total_transactions" and the age inserted with the right value for the Period that is missing (pre or post).

The code I am using is:

select distinct(customer_ID) as customer_ID, age, sum(net_amount) as net_amount, sum(lineitem) as total_visits, (case 
        when trans_date >= (current_timestamp - interval '18' month) AND trans_date < (current_timestamp - interval '6' month) then 'Post'
        when trans_date >= (current_timestamp - interval '30' month) AND trans_date < (current_timestamp - interval '18' month) then 'Pre'
        else 'nope'
        end) as Period
from trans_history
where trans_date >= (current_timestamp - interval '30' month) AND trans_date < (current_timestamp - interval '6' month)
group by 1, 5
order by 1 desc;

What I want:

customer_ID age net_amount total_transactions Period
ABCDRF 25 0 0 Pre
ABCDRF 25 85.0 2 Post
HDWSAZ 37 23.98 1 Pre
HDWSAZ 37 56.34 2 Post
OAYDI 42 8.2 1 Pre
OAYDI 42 98.29 4 Post
IUSTRA 31 88.62 1 Pre
IUSTRA 31 0 0 Post
MAHRYS 56 34.16 2 Pre
MAHRYS 56 0 0 Post

Any idea how I can get these rows added in with this code in Athena? I feel like it is just something I need to add in the "case when" statement but I can't figure it out.

Thanks! Mark


Solution

  • The case expression is not sufficient here, since some customers do not have data for both periods.

    Instead, one approach is to cross join the list of customers with the fixed, predefined periods in order to generate all possible combinations. Then, we can bring the table with a left join, then aggregate:

    select c.customer_id, c.age, p.period
        coalesce(sum(net_amount), 0) as net_amount, 
        coalesce(sum(lineitem)  , 0) as total_visits
    from (select distinct customer_id, age from trans_history) c
    cross join (
        select 'Post' as period
            current_timestamp - interval '18' month as start_date, 
            current_timestamp - interval '6'  month as end_date
        union all
        select 'Pre', 
            current_timestamp - interval '30' month, 
            current_timestamp - interval '18' month
    ) p
    left join trans_history t
        on  t.customer_id = c.customer_id
        and t.trans_date >= p.start_date
        and t.trans_date <  p.end_date
    group by c.customer_id, c.age, p.period