sql-server t-sql sql-server-2012 duplicates greatest-n-per-group

Return all columns excluding rows with a duplicateID in one column

I have an interesting issue.

I inherited a sloppy database with a table that has duplicate rows. However, they are not exact duplicates due to one column(a text column).

Here is an example:

TestID            TestDescription                                               Cost
115893hc127aaq    Etiology    • Understand the causes of acute pancreatitis     $10
115893hc127aaq    Etiology • Understand   the causes of acute pancreatitis      $10
115893hc127aaq    Etiology  • Understand the causes of acute   pancreatitis     $10

You can see that all the data except the 'TestDescription' is identical.

There are 1000's of rows like this where there might be 2 or 3 duplicate rows with minor spacing or spelling issues in 'TestDescription'

Because of this, using DISTINCT won't work.

I want to SELECT all rows but only get one row for each TestID...lets say the first one, then ignore the rest.

I tried SELECT DISTINCT *

But I can't do this using DISTINCT because TestDescription contains minor differences between rows.

SELECT DISTINCT TestID works, but that only returns TestID and I need to see all columns.

Is there a way of doing this in Sql Server 2012?

Thanks!

Solution

One approach uses row_number():

select *
from (
    select t.*, row_number() over(partition by testid order by (select null)) rn
    from mytable t
) t
where rn = 1

This assumes that you want one row per testid, as your question suggests.

You did not tell which column you want to use to break the ties, and I am unsure there is actually one, so I odered by (select null). This is not a deterministic order by clause, so consequent executions of the query might not always select the same row from a given duplicate group.