Search code examples
sqlpysparkgroup-bywindow-functionsdelta-lake

SQL Query group by to find EffectiveFrom and EffectiveTo Dates for repeating key value pair


I have a table with a snapshot of repeated data for a key and value pair with record_timestamps. The dataset can also contain late-arriving historical record_timestamps as well. I want to generate an output with an EffectiveFrom and EffectiveTo timestamp for each combination of key and value, where the validity closes when the value changes and a new record is created for a new combination. Note: The same values can reappear after an interval with a different record_timestamp.

Snapshot of repeated data

create table equipments_staging
(
    equipment string,
    item string,
    value string,
    timestamp   timestamp
)

insert into equipments_staging
values 
('M','X','1','2023-11-10 13:00')
,('M','X','2','2023-11-11 13:00')
,('M','X','2','2023-11-12 13:00')
,('M','X','1','2023-11-13 13:00')
,('M','X','1','2023-11-14 13:00')
,('M','X','1','2023-11-15 13:00')
,('M','X','2','2023-11-16 13:00')
,('M','X','2','2023-11-17 13:00')
,('M','X','1','2023-11-18 13:00')

enter image description here

Expected Outcome

enter image description here

I have tried the use lead and lag partitioned by the keys and ordered by the timestamp for each record to find the prev and next value and use this to find a way to group sets. However, this does not work as I am missing something since the same value can reappear after an interval, which should not be grouped together.

Select equipment,item,value,
LAG(value,1) OVER (PARTITION BY equipment,item ORDER BY timestamp) AS prev,
LEAD(value,1) OVER (PARTITION BY equipment,item ORDER BY timestamp) AS next,
coalesce(LEAD(timestamp,1),'9999-12-31') OVER (PARTITION BY equipment,item ORDER BY timestamp) AS next_timestamp,
timestamp
from repeated_data

Solution

  • Try this:

    with add_grp_id as (
    SELECT *
              , rank() OVER (PARTITION BY equipment, item ORDER BY timestamp asc) 
                - rank() OVER (PARTITION BY equipment, item, value ORDER BY timestamp) AS GroupId
           FROM equipments_staging
        )
        select equipment, item, value
        ,min(timestamp) as effective_from
        ,max(timestamp) as effective_to
        from add_grp_id
        group by equipment, item, value, GroupId
        order by equipment, item, effective_from
        ;