Search code examples
sqlhiveduplicateshiveqlhive-partitions

Hive table deduplication across multiple partitions


I am trying to de duplicate a table that may have duplicates across partitions. For example

id   device_id  os   country unix_time app_id      dt
2       2       3a      UK     7       5       2019-12-22
1       2       3a      USA     4       5       2019-12-22
1       2       3a      USA     4       5       2019-12-23
1       2       3a      USA     4       5       2019-12-24

It can be seen that the table has similar column values apart from the 'dt' which is the partition column. I want to de duplicate such a table where similar records in older partitions will be dropped leaving the record in the newest partition. For example, the table above should look like the table below after deduplication.

id   device_id  os   country unix_time app_id      dt
2       2       3a      UK     7       5       2019-12-22
1       2       3a      USA     4       5       2019-12-24

Solution

  • Using row_number you can filter duplicates:

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    
    INSERT OVERWRITE TABLE table_name_de_duplicated PARTITION (dt)    
    select id, device_id, os, country, unix_time, app_id, dt
      from 
         (select id, device_id, os, country, unix_time, app_id, dt,
                 row_number() over(partition by id, device_id, os, country, unix_time, app_id order by dt desc ) rn
            from table_name 
         ) s
      where rn=1
    distribute by dt --to reduce the number of created files
    ;
    

    Some partitions may disappear after de-duplication but insert overwrite will not remove them if you are rewriting the same table, so, if you want to save data in the same table, then better create table like initial one, insert deduplicated data in it, then remove initial table and rename de-duplicated one.