There have been a few questions like this, with no answer, like this one here.
I thought I would post another in hopes of getting one.
I have a hive table with duplicate rows. Consider the following example:
*ID Date value1 value2*
1001 20160101 alpha beta
1001 20160201 delta gamma
1001 20160115 rho omega
1002 20160101 able charlie
1002 20160101 able charlie
When complete, I only want two records. Specifically, these two:
*ID Date value1 value2*
1001 20160201 delta gamma
1002 20160101 able charlie
Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.
So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.
I hope my explanation is clear, and I appreciate any insight.
Thank you
Try this:
WITH temp_cte AS (
SELECT mt.ID AS ID
, mt.Date AS Date
, mt.value1 AS value1
, mt.value2 AS value2
, ROW_NUMBER() OVER (PARTITION BY mt.ID ORDER BY mt.Date DESC) AS row_num
FROM my_table mt
)
SELECT tc.ID AS ID
, tc.Date AS Date
, tc.value1 AS value1
, tc.value2 AS value2
FROM temp_cte tc
WHERE tc.row_num = 1
;
Or you can do MAX() and join the table to itself where ID = ID and max_date = Date. HTH.
Edit March 2022: Since ROW_NUMBER numbers every row and the user only cares about 1 row with the max date there's a better way to do this I discovered.
WITH temp_cte AS (
SELECT mt.ID AS ID
, MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2)) AS my_struct
FROM my_table mt
GROUP BY mt.ID
)
SELECT tt.ID AS ID
, tt.my_struct.Date AS Date
, tt.my_struct.Value1 AS Value1
, tt.my_struct.Value2 AS Value2
FROM temp_cte tt
;