Search code examples
sqldb2

SQL: Keeping One Instance of Each Duplicate Set


This is a problem I always struggle with.

Suppose I have the following table (table_a):

CREATE TABLE table_a (
    name INT,
    height INT,
    year INT
);

INSERT INTO table_a (name, height, year)
VALUES
(1, 180, 2010),
(1, 180, 2010),
(1, 180, 2010),
(2, 175, 2020),
(2, 175, 2020),
(3, 150, 2019);


  name height year
    1    180 2010
    1    180 2010
    1    180 2010
    2    175 2020
    2    175 2020
    3    150 2019

I only want each name to appear once in the final result - I don't care which one:

  name height year
    1    180 2010
    2    175 2020
    3    150 2019

Normally, I would have used the rank/partition statements and order by some non-unique column. However in this case, there is no non-unique column. Thus, ranking will give each name the same rank, so I can't do a rank/partition and select where rank=1.

I thought of this indirect way to solve the problem:

select * from (

select *,
 row_number() over (partition by name order by (select null)) as rn
from table_a)g

where rn=1);

Is this the standard way to solve this kind of problem?

Thanks!


Solution

  • A regular distinct select should work here:

    SELECT DISTINCT name, height, year
    FROM table_a
    ORDER BY name;
    

    But this assumes that all 3 columns would always have the same value per name group. If this is not the case, then you could use ROW_NUMBER() with a random ordering:

    WITH cte AS (
        SELECT t.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY RAND()) rn
        FROM table_a t
    )
    
    SELECT name, height, year
    FROM cte
    WHERE rn = 1;