Looking for some assistance as I'm getting two observation in the output and I'm expecting one, think its due to the multiple repeating date stamps.
id did rid date_pulled s created_at date time
1 123 123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-11 07:48:18.692
1 123 123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-19 02:26:26.836
2 123 123 2020-05-07 20:50:49.818 2 2020-05-07 20:50:49.818 2021-03-19 00:09:36.199 2023-04-19 02:26:26.836
1 123 123 2020-05-07 20:50:49.155 1 2020-05-07 20:50:49.155 2021-03-19 00:09:36.199 2023-04-21 11:54:11.132
2 123 123 2020-05-07 20:50:49.818 2 2020-05-07 20:50:49.818 2021-03-19 00:09:36.199 2023-04-21 11:54:11.132
I'm expecting a single observation output with the following 3 columns, could not squeeze score column into the example data above. Working code below.
select date_pulled, score, did
from (
select *, max(time) over (partition by did) max_time
from grs_datalake_prod_distill_lexis_nexis_db.credit_report_detail
)
where max_time = time
If you will add time
to select you will see that you have two rows with the same time
:
-- sample data
WITH dataset(id, did, rid, date_pulled, s, created_at, date, time) AS (
values (1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-11 07:48:18.692'),
(1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-19 02:26:26.836'),
(2, 123, 123, '2020-05-07 20:50:49.818', 2, '2020-05-07 20:50:49.818', '2021-03-19 00:09:36.199', '2023-04-19 02:26:26.836'),
(1, 123, 123, '2020-05-07 20:50:49.155', 1, '2020-05-07 20:50:49.155', '2021-03-19 00:09:36.199', '2023-04-21 11:54:11.132'),
(2, 123, 123, '2020-05-07 20:50:49.818', 2, '2020-05-07 20:50:49.818', '2021-03-19 00:09:36.199', '2023-04-21 11:54:11.132')
)
-- query
select date_pulled, s, did, time
from (
select *, max(time) over (partition by did) max_time
from dataset
)
where max_time = time;
Output:
date_pulled | s | did | time |
---|---|---|---|
2020-05-07 20:50:49.155 | 1 | 123 | 2023-04-21 11:54:11.132 |
2020-05-07 20:50:49.818 | 2 | 123 | 2023-04-21 11:54:11.132 |
So you need some kind of tie-breaker here. You can use "random" (not truly random of course) following the row_number
approach:
select date_pulled, s, did, time
from (
select *,
-- can add extra sorting
-- like date_pulled desc to order by
-- to serve as tiebreaker
row_number() over (partition by did order by time desc) rn -- order by time desc, date_pulled desc
from dataset
)
where rn = 1;
Output:
date_pulled | s | did | time |
---|---|---|---|
2020-05-07 20:50:49.155 | 1 | 123 | 2023-04-21 11:54:11.132 |
Or similar using group by
+ max_by
:
select did, max_by(date_pulled, time) date_pulled, max_by(s, time) s
from dataset
group by did;
did | date_pulled | s |
---|---|---|
123 | 2020-05-07 20:50:49.155 | 1 |