I have the following table:
child_id | child_dob | parent_id | parent_dob |
---|---|---|---|
1 | 2021-01-04 | 1 | 2021-01-01 |
2 | 2021-01-30 | 1 | 2021-01-01 |
3 | 2021-03-10 | 2 | 2021-01-15 |
4 | 2021-04-13 | 2 | 2021-01-15 |
I'm trying to get for every month since the parents DOB how many children were born in that month as well as date when the first child was born
so the final output should be :
month | parent_id | count_of_children_in_month | date_first_child |
---|---|---|---|
2021-01-01 | 1 | 2 | 2021-01-04 |
2021-02-01 | 1 | 0 | 2021-01-04 |
2021-03-01 | 1 | 0 | 2021-01-04 |
. | . | . | . |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2021-01-01 | 2 | 0 | 2021-03-10 |
2021-02-01 | 2 | 0 | 2021-03-10 |
2021-03-01 | 2 | 1 | 2021-03-10 |
2021-04-01 | 2 | 1 | 2021-03-10 |
. | . | . | . |
what i have so far is a partition by the parents_dob with date_trunc but not really finding any good way to do a continuous month added until curr_timestamp. Drawing a complete blank as on how to proceed with getting the window to compute a month at a time and how to increment this till current_timestamp
select count(dd.child_id) over w as count_of_children_in_month,
parent_dob,
min(dd.child_dob) over w as first_child_dob
from "awsdatacatalog"."stackoverflow"."desired_data" as dd
window w as (
partition by dd.parent_dob between date_trunc('month', dd.parent_dob) and current_timestamp
)
I'm using Athena so i have all the trino functions available to me.
You can do something like the following:
-- sample data
WITH dataset (child_id, child_dob, parent_id, parent_dob) AS (
values (1, date '2021-01-04', 1, date '2021-01-01'),
(2, date '2021-01-30', 1, date '2021-01-01'),
(3, date '2021-03-10', 2, date '2021-01-15'),
(4, date '2021-04-13', 2, date '2021-01-15')
),
-- query
-- generate all month in range, note that there is limit of 10k elements for sequence
dates as (
select *
from unnest(sequence(date '2021-01-01',
date '2021-05-01', -- current_date
interval '1' month)) as t(dt)
),
-- generate all month/parents pairs
all_dates_parents as (
select *
from dates d
cross join (select distinct parent_id from dataset) t
)
-- generate the result by join and aggregating
select r.*
, count(child_id) count_of_children_in_month
, min(min(ds.child_dob)) over(partition by r.parent_id) date_first_child
from all_dates_parents r
left join dataset ds on r.parent_id = ds.parent_id and r.dt = date_trunc('month', child_dob)
group by r.dt, r.parent_id
order by r.parent_id, r.dt;
Output:
dt | parent_id | count_of_children_in_month | date_first_child |
---|---|---|---|
2021-01-01 | 1 | 2 | 2021-01-04 |
2021-02-01 | 1 | 0 | 2021-01-04 |
2021-03-01 | 1 | 0 | 2021-01-04 |
2021-04-01 | 1 | 0 | 2021-01-04 |
2021-05-01 | 1 | 0 | 2021-01-04 |
2021-01-01 | 2 | 0 | 2021-03-10 |
2021-02-01 | 2 | 0 | 2021-03-10 |
2021-03-01 | 2 | 1 | 2021-03-10 |
2021-04-01 | 2 | 1 | 2021-03-10 |
2021-05-01 | 2 | 0 | 2021-03-10 |