sql subquery window-functions presto trino

Looking to get a single observation output

Looking for some assistance as I'm getting two observation in the output and I'm expecting one, think its due to the multiple repeating date stamps.

id did  rid date_pulled             s created_at              date                    time          
1  123  123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-11 07:48:18.692 
1  123  123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-19 02:26:26.836
2  123  123 2020-05-07 20:50:49.818 2 2020-05-07 20:50:49.818 2021-03-19 00:09:36.199 2023-04-19 02:26:26.836
1  123  123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-21 11:54:11.132
2  123  123 2020-05-07 20:50:49.818 2 2020-05-07 20:50:49.818 2021-03-19 00:09:36.199 2023-04-21 11:54:11.132

I'm expecting a single observation output with the following 3 columns, could not squeeze score column into the example data above. Working code below.

select  date_pulled, score, did
from (
    select *,  max(time) over (partition by did) max_time
    from grs_datalake_prod_distill_lexis_nexis_db.credit_report_detail
)
where max_time = time

Solution

If you will add time to select you will see that you have two rows with the same time:

-- sample data
WITH dataset(id, did, rid, date_pulled, s, created_at, date, time)     AS (
    values (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-11 07:48:18.692'),
        (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-19 02:26:26.836'),
        (2, 123, 123, '2020-05-07 20:50:49.818', 2, '2020-05-07 20:50:49.818', '2021-03-19 00:09:36.199', '2023-04-19 02:26:26.836'),
        (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-21 11:54:11.132'),
        (2, 123, 123, '2020-05-07 20:50:49.818', 2, '2020-05-07 20:50:49.818', '2021-03-19 00:09:36.199', '2023-04-21 11:54:11.132')
)

-- query
select  date_pulled, s, did, time
from (
    select *,  max(time) over (partition by did) max_time
    from dataset
)
where max_time = time;

Output:

date_pulled	s	did	time
2020-05-07 20:50:49.155	1	123	2023-04-21 11:54:11.132
2020-05-07 20:50:49.818	2	123	2023-04-21 11:54:11.132

So you need some kind of tie-breaker here. You can use "random" (not truly random of course) following the row_number approach:

select  date_pulled, s, did, time
from (
    select *,
        -- can add extra sorting
        -- like date_pulled desc to order by 
        -- to serve as tiebreaker
        row_number() over (partition by did order by time desc) rn -- order by time desc, date_pulled desc
    from dataset
)
where rn = 1;

Output:

date_pulled	s	did	time
2020-05-07 20:50:49.155	1	123	2023-04-21 11:54:11.132

Or similar using group by + max_by:

select  did, max_by(date_pulled, time) date_pulled, max_by(s, time) s 
from dataset
group by did;

did	date_pulled	s
123	2020-05-07 20:50:49.155	1