Search code examples
sqlpostgresqlduplicatessubqueryimputation

impute duplicate record id from conditions on other columns sql


Imagine I have this data set:

serial_id | name | address_id | id_duplicates | dob
_______________________________________________________
1         | JOHN | QWERTY     | NULL          | 10/2001
2         | JOHN | QWERTY     | NULL          | 10/2001
3         | JOHN | AZERTY     | NULL          | 10/2001
4         | JOHN | QWERTY     | NULL          | 09/2001
5         | MARY | QWERTY     | NULL          | 10/2001
6         | MARY | AZERTY     | NULL          | 10/2001
7         | MARY | AZERTY     | NULL          | 10/2001

I want to fill id_duplicates with any of the serial_id when records match on some conditions.

If I want records with same match name, address_id and dob to share a single id from the serial_id column, I would then for example have:

serial_id | name | address_id | id_duplicates | dob
_______________________________________________________
1         | JOHN | QWERTY     | 1             | 10/2001  --> match
2         | JOHN | QWERTY     | 1             | 10/2001  --> match
3         | JOHN | AZERTY     | 3             | 10/2001  --> no match on address_id
4         | JOHN | QWERTY     | 4             | 09/2001  --> no match on dob
5         | MARY | QWERTY     | 5             | 10/2001  --> no match on name
6         | MARY | AZERTY     | 6             | 10/2001  --> match 
7         | MARY | AZERTY     | 6             | 10/2001  --> match

I've been miserably trying to do this with nested queries which I am ebarassed to post as they make no sense...

any help would be appreciated!


Solution

  • You can use dense_rank():

    select t.*,
           dense_rank() over (order by name, address, dob) as id_duplicate
    from t;
    

    If you want this in an update, here is one method:

    update t
        set id_duplicate = tt.new_id_duplicate
        from (select t.*, 
                     dense_rank() over (order by name, address, dob) as new_id_duplicate
              from t
             ) tt
        where tt.serial_id = t.serial_id;