I am trying to get on record from a list of overlapping records in Presto(Trillo) DB. There are columns like loc_id, prod_id, line_id, start_dt, end_dt, update_dt.
The records are like this
( source records)
loc_id | prod_id | line_id | qty | start_dt | end_dt | upd_dt |
---|---|---|---|---|---|---|
200 | 10 | 1 | 500 | 30-1-2025:9:00 | 30-1-2025:15:00 | 30-1-2025:15:05 |
200 | 10 | 1 | 600 | 30-1-2025:11:00 | 30-1-2025:16:00 | 30-1-2025:16:05 |
200 | 10 | 1 | 550 | 30-1-2025:12:00 | 30-1-2025:18:00 | 30-1-2025:18:05 |
200 | 10 | 1 | 350 | 30-1-2025:18:00 | 30-1-2025:22:00 | 30-1-2025:22:05 |
There is no specific key column. In the above data, the first 3 records overlap based on start and end date timestamp. But the start date of 4th record is the end date of 3rd record.
As an output I want to select only the latest record from the first 3 records and the 4th record.
So select the latest record from the overlapping records.
If the records are not overlapping , select that records.
Expected output
loc_id | prod_id | line_id | qty | start_dt | end_dt | upd_dt |
---|---|---|---|---|---|---|
200 | 10 | 1 | 550 | 30-1-2025:12:00 | 30-1-2025:18:00 | 30-1-2025:18:05 |
200 | 10 | 1 | 350 | 30-1-2025:18:00 | 30-1-2025:22:00 | 30-1-2025:22:05 |
select
loc_id
prod_id,
line_id,
start_dt,
end_dt,
qty,
upd_dt
from
(select *,
rank() over (partition by a.loc_id ,
a.prod_id,
a.line_id
order by cast(upd_dt as timestamp )desc) as rnk
from
tbl a
) tt
--where tt.rnk = 1
order by tt.rnk
I tried using a rank function but it is difficult to rank it with time stamp. The overlapping records are an issue.
A similar sql as above. I cannot display exact SQL or data due to restrictions
If I understand the task correctly then you can rephrase it as "I want data split into groups in a way that for every row previous row ended later then the current one and from each group I need the latest record". Then you can use approach derived from gaps-and-islands:
-- sample data
WITH dataset(loc_id, prod_id, line_id, qty, start_dt, end_dt, upd_dt) as (
values (200, 10, 1, 500, timestamp '2025-01-25 09:00', timestamp '2025-01-30 15:00', '30-1-2025:15:05'),
(200, 10, 1, 600, timestamp '2025-01-25 11:00', timestamp '2025-01-30 16:00', '30-1-2025:16:05'),
(200, 10, 1, 550, timestamp '2025-01-25 12:00', timestamp '2025-01-30 18:00', '30-1-2025:18:05'),
(200, 10, 1, 350, timestamp '2025-01-25 18:00', timestamp '2025-01-30 22:00', '30-1-2025:22:05')
),
-- query parts
with_leads as (
select *,
lead(start_dt) over (partition by loc_id, prod_id, line_id order by start_dt) next_start -- determine when next row starts
from dataset
),
with_groups as (
select *,
sum(if(next_start < end_dt, 0, 1)) over (partition by loc_id, prod_id, line_id order by start_dt) grp -- form groups
from with_leads
),
with_rns as(
select *,
row_number() over(partition by loc_id, prod_id, line_id, grp order by start_dt desc) rn -- prepare to select latest started in the group
from with_groups
)
select loc_id, prod_id, line_id, qty, start_dt, end_dt, upd_dt
from with_rns
where rn = 1;
Output:
loc_id | prod_id | line_id | qty | start_dt | end_dt | upd_dt |
---|---|---|---|---|---|---|
200 | 10 | 1 | 550 | 2025-01-25 12:00:00 | 2025-01-30 18:00:00 | 30-1-2025:18:05 |
200 | 10 | 1 | 350 | 2025-01-25 18:00:00 | 2025-01-30 22:00:00 | 30-1-2025:22:05 |
Note that this will break if you have gaps in end data (i.e. if you change the end_dt
for second row to '2025-01-25 11:30'
)