sql sql-server t-sql duplicates distinct

Filter on specific columns and return all columns

I am trying to left join two tables and retrieve all columns from table one but remove duplicates based on a set of columns.

SELECT A.*, B.impact
FROM #Site_one AS A WITH (NOLOCK)
LEFT JOIN #Progress AS B With (NOLOCK)
   ON lower(A.site_code) = lower(B.site_code)
GROUP BY A.date, A.operationid, A.worklocation, A.siteid, A.alias

This does not work as there will be column in A which either need to be aggregated or be added to the group by clause. The issue with that is that I do not want to filter on those columns and do not want them aggregated.

Is there a way to select all columns in A and the impact column in B and still be able to filter out duplicates on the columns specified in the group by clause?

Any pointers/help would be greatly appreciated.

Solution

and still be able to filter out duplicates on the columns specified in the group by clause

But, how does the database really know which rows to throw away? Suppose you have:

Person
John, 42, Stockbroker
John, 36, Train driver
John, 58, Retired
John, 58, Metalworker

And you think "I wanna dedupe those based on the name":

SELECT * FROM person GROUP BY name

So which three Johns should the DB throw away?

It cannot decide this for you; you have to write the query to make it clear what you want to keep or throw

You could MAX everything:

SELECT name, MAX(age), MAX(job) FROM person GROUP BY name

That'll work.. but it gives you a John that never existed in the original data:

John, 58, Train driver

You could say "I'll only keep the person with the max age":

SELECT p.* 
FROM
  person p 
  INNER JOIN (SELECT name, max(age) as maxage FROM person GROUP BY name) maxp
  ON p.name = maxp.name AND p.age = maxp.maxage

.. but there are two people with the same max age.

Your DB might have a row number analytic, which is nice:

SELECT *, row_number() over(PARTITION BY name ORDER BY age DESC) rn
FROM person

One of your 58 year old Johns will get row number 1 - can't be sure which one, but you could then discard all the rows with an rn > 1:

WITH x as (    
  SELECT *, row_number() over(PARTITION BY name ORDER BY age DESC) rn
  FROM person
)
SELECT name, age, job 
INTO newtable
FROM x
WHER rn = 1

..but what if you discarded the wrong John...

You're going to have to go and think about this some more, and exactly specify what to throw away...