Search code examples
sqlduplicatesmin

SQL: removing duplicates based on different criteria, actually creates new records


I have a data base (dbo) with duplicates. In particular, one employee can work two roles (Role Number) in the same business (Business code) or work two / the same role within different business in the same or different area (Area Code), see below:

enter image description here

What I want is to remove duplicate records. Thus, I created this code:

Select 
   dbo.year,
   min(dbo.RoleNumber) AS Role,
   min(dbo.AreaCode) AS Area,
   min(dbo.BusinessCode) AS BCode,
   dbo.EmployeeNumber
From dbo
Group by dbo.year, dbo.EmployeeNumber

This code works well when an individual works the lowest role in a business with the lowest number and in the lowest area (e.g., row n* 3 and 4 in my example) or where the area code and business code are the same in the duplicate records (e.g., row n* 1 and 2).

However, I have some cases where an individual’s lowest role is associated with a higher Business code or/and area code. In this case, SQL creates new records combining these elements see examples below:

rows 5-10:  2018, 651, 5110, 3, 17;
rows 11-13: 2018, 649, 6215, 4, 20;
rows 14-15: 2018, 750, 5101, 5, 24.

This is not a problem per se, but it is problematic when I join tables to get additional data for these employees. The key elements to join tables are Area and business codes and employee's number, however with my code SQL is creating new records that do not exist in other tables, this leads to additional data being NULL.

Is there a way to fix this? I need SQL to always select the lowest Role number first, if the role number is the same then the lowest establishment number should be selected and if the same, the lowest Area code should finally be selected.

So for instance, I would expect that the three records creating problems would be retrieved like this:

rows 5-10:  2018, 651, 6319, 3, 17;
rows 11-13: 2018, 650, 6215, 4, 20;
rows 14-15: 2018, 750, 8076, 5, 24.

Thank you

Silvia


Solution

  • you can use window function:

    select * from 
      ( 
        select * , row_number() over (partition by year, employeenumber order by rolenumber,businesscode,areacode) rn 
        from youratble
      ) t
    where rn = 1
    

    you can play with order by inside the window function to choose the row you want.