Search code examples
sqlprestogaps-and-islandstrino

splitting start and end date to start_date / first day of month and end_date / last day of month


start_date end_date
2018-02-21 2018-03-21
2018-03-21 2018-04-21
2018-04-21 2018-05-21
2018-05-21 2018-06-21

and I would like to turn it into:

start_date end_date
2018-02-21 2018-02-28
2018-03-01 2018-03-31
2018-04-01 2018-04-30
2018-05-01 2018-05-31
2018-06-01 2018-06-21

Solution

  • You may use this approach:

    1. Collapse consequtive intervals into one. I prefer to use match_recognize for it.
    2. Generate as much rows as there are months between start and end date of each interval: use sequence function in the correlated subquery.
    3. For each row decide what is the start/end dates based on the interval.

    Below is a query:

    select
      grouped.id, /*Common key for group of rows*/
      grouped.grp, /*Sequence number for collapsed interval per ID*/
      cast(greatest(grouped.frame_from, mon.dt) as date) as date_start,
      least(grouped.frame_to, last_day_of_month(mon.dt)) as date_end
    from t
    match_recognize (
      partition by id /*Restart grouping at each ID*/
      order by start_date asc
      measures
        match_number() as grp,
        /*start_date of collapsed interval is the start_date of the first match*/
        first(start_date) as frame_from,
        /*end_date - end_date of the last match*/
        last(end_date) as frame_to
      /*Match group is any row followed by any number of consequtive intervals*/
      pattern (any_ conseq*)
      define
        /*Add row to a group if start_date is equal to end_date of the previous row*/
        conseq as start_date = prev(end_date)
    ) as grouped
      /*Generate rows for each grouped interval*/
      cross join unnest(sequence(
        cast(date_trunc('month', frame_from) as timestamp),
        cast(frame_to as timestamp),
        interval '1' month
      )) as mon(dt)
    order by 1,2 asc,3 asc
    

    which for this sample data

    with t(id, start_date, end_date) as (
      values
      (1, date '2018-02-21', date '2018-03-21'),
      (1, date '2018-03-21', date '2018-04-21'),
      (1, date '2018-04-21', date '2018-05-21'),
      (1, date '2018-05-21', date '2018-06-21'),
      (1, date '2018-06-25', date '2018-08-23'),
      (2, date '2022-01-15', date '2022-02-15'),
      (2, date '2022-02-15', date '2022-03-15')
    )
    

    returns this result in AWS Athena 3 (which is based on Trino):

    # id grp date_start date_end
    1 1 1 2018-02-21 2018-02-28
    2 1 1 2018-03-01 2018-03-31
    3 1 1 2018-04-01 2018-04-30
    4 1 1 2018-05-01 2018-05-31
    5 1 1 2018-06-01 2018-06-21
    6 1 2 2018-06-25 2018-06-30
    7 1 2 2018-07-01 2018-07-31
    8 1 2 2018-08-01 2018-08-23
    9 2 1 2022-01-15 2022-01-31
    10 2 1 2022-02-01 2022-02-28
    11 2 1 2022-03-01 2022-03-15