Search code examples
sqlprestooverlappingtrino

How to find overlapping records and select the latest record?


I am trying to get on record from a list of overlapping records in Presto(Trillo) DB. There are columns like loc_id, prod_id, line_id, start_dt, end_dt, update_dt.

The records are like this

( source records)

loc_id prod_id line_id qty start_dt end_dt upd_dt
200 10 1 500 30-1-2025:9:00 30-1-2025:15:00 30-1-2025:15:05
200 10 1 600 30-1-2025:11:00 30-1-2025:16:00 30-1-2025:16:05
200 10 1 550 30-1-2025:12:00 30-1-2025:18:00 30-1-2025:18:05
200 10 1 350 30-1-2025:18:00 30-1-2025:22:00 30-1-2025:22:05

There is no specific key column. In the above data, the first 3 records overlap based on start and end date timestamp. But the start date of 4th record is the end date of 3rd record.

As an output I want to select only the latest record from the first 3 records and the 4th record.
So select the latest record from the overlapping records.
If the records are not overlapping , select that records.

Expected output

loc_id prod_id line_id qty start_dt end_dt upd_dt
200 10 1 550 30-1-2025:12:00 30-1-2025:18:00 30-1-2025:18:05
200 10 1 350 30-1-2025:18:00 30-1-2025:22:00 30-1-2025:22:05

select 
loc_id
prod_id, 
line_id, 
start_dt,
end_dt, 
qty, 
upd_dt
from
(select *, 
rank() over (partition by a.loc_id , 
                          a.prod_id, 
                          a.line_id 
                          order by  cast(upd_dt as timestamp )desc) as rnk
from 
tbl a
) tt
--where tt.rnk = 1
order by tt.rnk

I tried using a rank function but it is difficult to rank it with time stamp. The overlapping records are an issue.

A similar sql as above. I cannot display exact SQL or data due to restrictions


Solution

  • If I understand the task correctly then you can rephrase it as "I want data split into groups in a way that for every row previous row ended later then the current one and from each group I need the latest record". Then you can use approach derived from gaps-and-islands:

    -- sample data
    WITH dataset(loc_id, prod_id, line_id, qty, start_dt, end_dt, upd_dt) as (
        values (200, 10, 1, 500, timestamp '2025-01-25 09:00', timestamp '2025-01-30 15:00', '30-1-2025:15:05'),
            (200, 10, 1, 600, timestamp '2025-01-25 11:00', timestamp '2025-01-30 16:00', '30-1-2025:16:05'),
            (200, 10, 1, 550, timestamp '2025-01-25 12:00', timestamp '2025-01-30 18:00', '30-1-2025:18:05'),
            (200, 10, 1, 350, timestamp '2025-01-25 18:00', timestamp '2025-01-30 22:00', '30-1-2025:22:05')
    ),
    
    -- query parts
    with_leads as (
        select *,
               lead(start_dt) over (partition by loc_id, prod_id, line_id order by start_dt) next_start -- determine when next row starts
        from dataset
    ),
    with_groups as (
        select *,
            sum(if(next_start < end_dt, 0, 1)) over (partition by loc_id, prod_id, line_id order by start_dt) grp -- form groups
        from with_leads
    ),
    with_rns as(
        select *,
           row_number() over(partition by loc_id, prod_id, line_id, grp order by start_dt desc) rn -- prepare to select latest started in the group
        from with_groups
    )
    
    select loc_id, prod_id, line_id, qty, start_dt, end_dt, upd_dt
    from with_rns
    where rn = 1;
    

    Output:

    loc_id prod_id line_id qty start_dt end_dt upd_dt
    200 10 1 550 2025-01-25 12:00:00 2025-01-30 18:00:00 30-1-2025:18:05
    200 10 1 350 2025-01-25 18:00:00 2025-01-30 22:00:00 30-1-2025:22:05

    Note that this will break if you have gaps in end data (i.e. if you change the end_dt for second row to '2025-01-25 11:30')