Search code examples
sqlsql-servert-sqlduplicatesdistinct

Filter on specific columns and return all columns


I am trying to left join two tables and retrieve all columns from table one but remove duplicates based on a set of columns.

SELECT A.*, B.impact
FROM #Site_one AS A WITH (NOLOCK)
LEFT JOIN #Progress AS B With (NOLOCK)
   ON lower(A.site_code) = lower(B.site_code)
GROUP BY A.date, A.operationid, A.worklocation, A.siteid, A.alias

This does not work as there will be column in A which either need to be aggregated or be added to the group by clause. The issue with that is that I do not want to filter on those columns and do not want them aggregated.

Is there a way to select all columns in A and the impact column in B and still be able to filter out duplicates on the columns specified in the group by clause?

Any pointers/help would be greatly appreciated.


Solution

  • and still be able to filter out duplicates on the columns specified in the group by clause

    But, how does the database really know which rows to throw away? Suppose you have:

    Person
    John, 42, Stockbroker
    John, 36, Train driver
    John, 58, Retired
    John, 58, Metalworker
    

    And you think "I wanna dedupe those based on the name":

    SELECT * FROM person GROUP BY name
    

    So which three Johns should the DB throw away?

    It cannot decide this for you; you have to write the query to make it clear what you want to keep or throw

    You could MAX everything:

    SELECT name, MAX(age), MAX(job) FROM person GROUP BY name
    

    That'll work.. but it gives you a John that never existed in the original data:

    John, 58, Train driver
    

    You could say "I'll only keep the person with the max age":

    SELECT p.* 
    FROM
      person p 
      INNER JOIN (SELECT name, max(age) as maxage FROM person GROUP BY name) maxp
      ON p.name = maxp.name AND p.age = maxp.maxage
    

    .. but there are two people with the same max age.

    Your DB might have a row number analytic, which is nice:

    SELECT *, row_number() over(PARTITION BY name ORDER BY age DESC) rn
    FROM person
    

    One of your 58 year old Johns will get row number 1 - can't be sure which one, but you could then discard all the rows with an rn > 1:

    WITH x as (    
      SELECT *, row_number() over(PARTITION BY name ORDER BY age DESC) rn
      FROM person
    )
    SELECT name, age, job 
    INTO newtable
    FROM x
    WHER rn = 1
    

    ..but what if you discarded the wrong John...


    You're going to have to go and think about this some more, and exactly specify what to throw away...