Search code examples
hadoopapache-sparkmapreducehive

Hive - Remove duplicates, keeping newest record - all of it


There have been a few questions like this, with no answer, like this one here.

I thought I would post another in hopes of getting one.

I have a hive table with duplicate rows. Consider the following example:

*ID             Date           value1         value2*
1001            20160101       alpha          beta
1001            20160201       delta          gamma
1001            20160115       rho            omega
1002            20160101       able           charlie
1002            20160101       able           charlie

When complete, I only want two records. Specifically, these two:

*ID             Date           value1         value2*
1001            20160201       delta          gamma
1002            20160101       able           charlie

Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.

So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.

I hope my explanation is clear, and I appreciate any insight.

Thank you


Solution

  • Try this:

    WITH temp_cte AS (
            SELECT      mt.ID                                                           AS ID 
                        , mt.Date                                                       AS Date 
                        , mt.value1                                                     AS value1 
                        , mt.value2                                                     AS value2
                        , ROW_NUMBER() OVER (PARTITION BY mt.ID ORDER BY mt.Date DESC)  AS row_num
            FROM        my_table mt
        )
        SELECT      tc.ID           AS ID 
                    , tc.Date       AS Date 
                    , tc.value1     AS value1
                    , tc.value2     AS value2
        FROM        temp_cte tc 
        WHERE       tc.row_num = 1 
        ;
    

    Or you can do MAX() and join the table to itself where ID = ID and max_date = Date. HTH.

    Edit March 2022: Since ROW_NUMBER numbers every row and the user only cares about 1 row with the max date there's a better way to do this I discovered.

    WITH temp_cte AS (
        SELECT      mt.ID                                                                               AS ID 
                    , MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2))      AS my_struct
        FROM        my_table mt
        GROUP BY    mt.ID 
    )
    SELECT      tt.ID                       AS ID 
                , tt.my_struct.Date         AS Date
                , tt.my_struct.Value1       AS Value1
                , tt.my_struct.Value2       AS Value2
    FROM        temp_cte tt 
    ;