Search code examples
sql-servert-sqlsql-server-2012duplicatesgreatest-n-per-group

Return all columns excluding rows with a duplicateID in one column


I have an interesting issue.

I inherited a sloppy database with a table that has duplicate rows. However, they are not exact duplicates due to one column(a text column).

Here is an example:

TestID            TestDescription                                               Cost
115893hc127aaq    Etiology    • Understand the causes of acute pancreatitis     $10
115893hc127aaq    Etiology • Understand   the causes of acute pancreatitis      $10
115893hc127aaq    Etiology  • Understand the causes of acute   pancreatitis     $10

You can see that all the data except the 'TestDescription' is identical.

There are 1000's of rows like this where there might be 2 or 3 duplicate rows with minor spacing or spelling issues in 'TestDescription'

Because of this, using DISTINCT won't work.

I want to SELECT all rows but only get one row for each TestID...lets say the first one, then ignore the rest.

I tried SELECT DISTINCT *

But I can't do this using DISTINCT because TestDescription contains minor differences between rows.

SELECT DISTINCT TestID works, but that only returns TestID and I need to see all columns.

Is there a way of doing this in Sql Server 2012?

Thanks!


Solution

  • One approach uses row_number():

    select *
    from (
        select t.*, row_number() over(partition by testid order by (select null)) rn
        from mytable t
    ) t
    where rn = 1
    

    This assumes that you want one row per testid, as your question suggests.

    You did not tell which column you want to use to break the ties, and I am unsure there is actually one, so I odered by (select null). This is not a deterministic order by clause, so consequent executions of the query might not always select the same row from a given duplicate group.