Search code examples
sqlamazon-athenaprestotrino

Athena calculate count per month from a date to current timestamp


I have the following table:

child_id child_dob parent_id parent_dob
1 2021-01-04 1 2021-01-01
2 2021-01-30 1 2021-01-01
3 2021-03-10 2 2021-01-15
4 2021-04-13 2 2021-01-15

I'm trying to get for every month since the parents DOB how many children were born in that month as well as date when the first child was born

so the final output should be :

month parent_id count_of_children_in_month date_first_child
2021-01-01 1 2 2021-01-04
2021-02-01 1 0 2021-01-04
2021-03-01 1 0 2021-01-04
. . . .
. . . .
. . . .
. . . .
2021-01-01 2 0 2021-03-10
2021-02-01 2 0 2021-03-10
2021-03-01 2 1 2021-03-10
2021-04-01 2 1 2021-03-10
. . . .

what i have so far is a partition by the parents_dob with date_trunc but not really finding any good way to do a continuous month added until curr_timestamp. Drawing a complete blank as on how to proceed with getting the window to compute a month at a time and how to increment this till current_timestamp

select count(dd.child_id) over w as count_of_children_in_month,
    parent_dob,
    min(dd.child_dob) over w as first_child_dob
from "awsdatacatalog"."stackoverflow"."desired_data" as dd
 window w as (
        partition by dd.parent_dob between date_trunc('month', dd.parent_dob) and current_timestamp
    )

I'm using Athena so i have all the trino functions available to me.


Solution

  • You can do something like the following:

    -- sample data
    WITH dataset (child_id, child_dob, parent_id, parent_dob) AS (
        values (1,  date '2021-01-04',  1,  date '2021-01-01'),
            (2, date '2021-01-30',  1,  date '2021-01-01'),
            (3, date '2021-03-10',  2,  date '2021-01-15'),
            (4, date '2021-04-13',  2,  date '2021-01-15')
    ),
    
    -- query
    -- generate all month in range, note that there is limit of 10k elements for sequence
    dates as (
        select *
        from unnest(sequence(date '2021-01-01',
            date '2021-05-01', -- current_date
            interval '1' month)) as t(dt)
    ),
    -- generate all month/parents pairs
    all_dates_parents as (
        select *
        from dates d
        cross join (select distinct parent_id from dataset) t
    )
    
    -- generate the result by join and aggregating
    select r.*
         , count(child_id) count_of_children_in_month
         , min(min(ds.child_dob)) over(partition by r.parent_id) date_first_child
    from all_dates_parents r
    left join dataset ds on r.parent_id = ds.parent_id and r.dt = date_trunc('month', child_dob)
    group by r.dt, r.parent_id
    order by r.parent_id, r.dt;
    

    Output:

    dt parent_id count_of_children_in_month date_first_child
    2021-01-01 1 2 2021-01-04
    2021-02-01 1 0 2021-01-04
    2021-03-01 1 0 2021-01-04
    2021-04-01 1 0 2021-01-04
    2021-05-01 1 0 2021-01-04
    2021-01-01 2 0 2021-03-10
    2021-02-01 2 0 2021-03-10
    2021-03-01 2 1 2021-03-10
    2021-04-01 2 1 2021-03-10
    2021-05-01 2 0 2021-03-10