Imagine I have this data set:
serial_id | name | address_id | id_duplicates | dob
_______________________________________________________
1 | JOHN | QWERTY | NULL | 10/2001
2 | JOHN | QWERTY | NULL | 10/2001
3 | JOHN | AZERTY | NULL | 10/2001
4 | JOHN | QWERTY | NULL | 09/2001
5 | MARY | QWERTY | NULL | 10/2001
6 | MARY | AZERTY | NULL | 10/2001
7 | MARY | AZERTY | NULL | 10/2001
I want to fill id_duplicates
with any of the serial_id
when records match on some conditions.
If I want records with same match name
, address_id
and dob
to share a single id from the serial_id
column, I would then for example have:
serial_id | name | address_id | id_duplicates | dob
_______________________________________________________
1 | JOHN | QWERTY | 1 | 10/2001 --> match
2 | JOHN | QWERTY | 1 | 10/2001 --> match
3 | JOHN | AZERTY | 3 | 10/2001 --> no match on address_id
4 | JOHN | QWERTY | 4 | 09/2001 --> no match on dob
5 | MARY | QWERTY | 5 | 10/2001 --> no match on name
6 | MARY | AZERTY | 6 | 10/2001 --> match
7 | MARY | AZERTY | 6 | 10/2001 --> match
I've been miserably trying to do this with nested queries which I am ebarassed to post as they make no sense...
any help would be appreciated!
You can use dense_rank()
:
select t.*,
dense_rank() over (order by name, address, dob) as id_duplicate
from t;
If you want this in an update
, here is one method:
update t
set id_duplicate = tt.new_id_duplicate
from (select t.*,
dense_rank() over (order by name, address, dob) as new_id_duplicate
from t
) tt
where tt.serial_id = t.serial_id;