Hive - Remove duplicates, keeping newest record - all of it

There have been a few questions like this, with no answer, like this one here.

I thought I would post another in hopes of getting one.

I have a hive table with duplicate rows. Consider the following example:

*ID             Date           value1         value2*
1001            20160101       alpha          beta
1001            20160201       delta          gamma
1001            20160115       rho            omega
1002            20160101       able           charlie
1002            20160101       able           charlie

When complete, I only want two records. Specifically, these two:

*ID             Date           value1         value2*
1001            20160201       delta          gamma
1002            20160101       able           charlie

Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.

So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.

I hope my explanation is clear, and I appreciate any insight.

Thank you

Solution

Try this:

WITH temp_cte AS (
        SELECT      mt.ID                                                           AS ID 
                    , mt.Date                                                       AS Date 
                    , mt.value1                                                     AS value1 
                    , mt.value2                                                     AS value2
                    , ROW_NUMBER() OVER (PARTITION BY mt.ID ORDER BY mt.Date DESC)  AS row_num
        FROM        my_table mt
    )
    SELECT      tc.ID           AS ID 
                , tc.Date       AS Date 
                , tc.value1     AS value1
                , tc.value2     AS value2
    FROM        temp_cte tc 
    WHERE       tc.row_num = 1 
    ;

Or you can do MAX() and join the table to itself where ID = ID and max_date = Date. HTH.

Edit March 2022: Since ROW_NUMBER numbers every row and the user only cares about 1 row with the max date there's a better way to do this I discovered.

WITH temp_cte AS (
    SELECT      mt.ID                                                                               AS ID 
                , MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2))      AS my_struct
    FROM        my_table mt
    GROUP BY    mt.ID 
)
SELECT      tt.ID                       AS ID 
            , tt.my_struct.Date         AS Date
            , tt.my_struct.Value1       AS Value1
            , tt.my_struct.Value2       AS Value2
FROM        temp_cte tt 
;