Search code examples
sqlduplicatesclouderaimpalahue

SQL(Impala)- Finding duplicates values including null values


I have one table named as baseTable containing many columns however I am using 3 columns named as Material_Type, Material_Desc (Have null values), Material_Number to find duplicates using row_num and partition by. Note: I need to filter duplicates based on 3 conditions.

  1. When Material_Type = Material_Type
  2. When Material_Desc = Material_Desc
  3. When Material_Number <> Material_Number

Sample Table :

Material_Type  Material_Desc  Material_Number 
 ABC                XYZ              1
 ABC                XYZ              1
 ABC                XYZ              2
 ABC                XYZ              3
 DEF                IMM              1
 LMN                NULL             1
 LMN                NULL             2

I only want to have duplicate values in newTable and wanted to remove distinct values.

Desired Output :

Material_Type  Material_Desc  Material_Number  new
 ABC                XYZ              1          1
 ABC                XYZ              1          2
 ABC                XYZ              2          3
 ABC                XYZ              3          4
 LMN                NULL             1          1
 LMN                NULL             2          2

I used below query but not getting expected output as it's not including null values from column Material_Desc and didn't make partition with Null and also creates unwanted duplicate records.

query used :

create table newTable as 
with mycte as
(
select
m.MATERIAL_NUMBER
,m.MATERIAL_TYPE
,m.Material_Desc,
row_number() over(partition BY d.MATERIAL_TYPE,d.Material_Desc order by d.MATERIAL_NUMBER) as new
from baseTable m
inner join
(
select MATERIAL_NUMBER,MATERIAL_TYPE,Material_Desc,count(*) from baseTable group by
MATERIAL_NUMBER,MATERIAL_TYPE,Material_Desc having count(*) > 1
) d on d.MATERIAL_NUMBER <> m.MATERIAL_NUMBER and d.MATERIAL_TYPE=m.MATERIAL_TYPE 
and d.Material_Desc= m.Material_Desc)
select * from mycte 

Any help would be appreciated.


Solution

  • Just use row_number() and count(*) as a window function:

    select Material_Type, Material_Desc, Material_Number,
           row_number() over (partition by Material_Type, Material_Desc  order by Material_Number) as new
    from (select t.*,
                 count(*) over (partition by Material_Type, Material_Desc) as cnt
          from t
         ) t
    where cnt > 1;
    

    This works for the data you have provided by simply counting the rows for each type and description. If you really do need for the material numbers to be different, one method is min() and max():

    select Material_Type, Material_Desc, Material_Number,
           row_number() over (partition by Material_Type, Material_Desc order by Material_Number) as new
    from (select t.*,
                 min(Material_Number) over (partition by Material_Type, Material_Desc) as min_Material_Number,
                 max(Material_Number) over (partition by Material_Type, Material_Desc) as max_Material_Number
          from t
         ) t
    where min_Material_Number <> max_Material_Number;