Search code examples
sqlsubquerywindow-functionsprestotrino

Looking to get a single observation output


Looking for some assistance as I'm getting two observation in the output and I'm expecting one, think its due to the multiple repeating date stamps.

id did  rid date_pulled             s created_at              date                    time          
1  123  123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-11 07:48:18.692 
1  123  123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-19 02:26:26.836
2  123  123 2020-05-07 20:50:49.818 2 2020-05-07 20:50:49.818 2021-03-19 00:09:36.199 2023-04-19 02:26:26.836
1  123  123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-21 11:54:11.132
2  123  123 2020-05-07 20:50:49.818 2 2020-05-07 20:50:49.818 2021-03-19 00:09:36.199 2023-04-21 11:54:11.132

I'm expecting a single observation output with the following 3 columns, could not squeeze score column into the example data above. Working code below.

select  date_pulled, score, did
from (
    select *,  max(time) over (partition by did) max_time
    from grs_datalake_prod_distill_lexis_nexis_db.credit_report_detail
)
where max_time = time

Solution

  • If you will add time to select you will see that you have two rows with the same time:

    -- sample data
    WITH dataset(id, did, rid, date_pulled, s, created_at, date, time)     AS (
        values (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-11 07:48:18.692'),
            (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-19 02:26:26.836'),
            (2, 123, 123, '2020-05-07 20:50:49.818', 2, '2020-05-07 20:50:49.818', '2021-03-19 00:09:36.199', '2023-04-19 02:26:26.836'),
            (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-21 11:54:11.132'),
            (2, 123, 123, '2020-05-07 20:50:49.818', 2, '2020-05-07 20:50:49.818', '2021-03-19 00:09:36.199', '2023-04-21 11:54:11.132')
    )
    
    -- query
    select  date_pulled, s, did, time
    from (
        select *,  max(time) over (partition by did) max_time
        from dataset
    )
    where max_time = time;
    

    Output:

    date_pulled s did time
    2020-05-07 20:50:49.155 1 123 2023-04-21 11:54:11.132
    2020-05-07 20:50:49.818 2 123 2023-04-21 11:54:11.132

    So you need some kind of tie-breaker here. You can use "random" (not truly random of course) following the row_number approach:

    select  date_pulled, s, did, time
    from (
        select *,
            -- can add extra sorting
            -- like date_pulled desc to order by 
            -- to serve as tiebreaker
            row_number() over (partition by did order by time desc) rn -- order by time desc, date_pulled desc
        from dataset
    )
    where rn = 1;
    

    Output:

    date_pulled s did time
    2020-05-07 20:50:49.155 1 123 2023-04-21 11:54:11.132

    Or similar using group by + max_by:

    select  did, max_by(date_pulled, time) date_pulled, max_by(s, time) s 
    from dataset
    group by did;
    
    did date_pulled s
    123 2020-05-07 20:50:49.155 1