Search code examples
sqlduplicatessql-updatesql-deletessms-17

Finding lowest Ids of duplicates and updating tables according to these Ids


The problem

I have a sql database with a table for Hashtags, of which many are duplicates with regard to their names.

A statement like

SELECT     *  
FROM       HashTag  
ORDER BY   Name

returns something like

Id   |  Name
1947 |  test
1950 |  sample
1962 |  test
1963 |  sample
1986 |  test
2014 |  example

I want to keep only the hashtag with the lowest Id for each Name (1947 for 'test' and 1950 for 'sample') and update other tables with this Id, replacing the higher Ids (example: updating hashtag 'test'; lowest Id = 1947, higher Ids = 1962, 1986). These sql statements are updated manually as of now and would be as follows:

UPDATE             HashTaggedActivity
SET [HashTag_id]   = 1947
WHERE HashTag_id   in (1962, 1986)

Update             HashTaggedGroup
SET [HashTag_id]   = 1947
WHERE HashTag_id   in (1962, 1986)

DELETE             ht
FROM               HashTag ht
WHERE              ht.Id in (1962, 1986)

After this I have to do this for HashTag 'sample', which is an error prone and tedious process. The HashTag 'example' is not a duplicate and should not result in updating other tables.

Is there any way to write an sql statement for doing this for each occurence of duplicate names in the table HashTag?

What I've tried so far

I think I have to combine a statement for getting a duplicate count ordered by Id

select ht.Id, ht.Name, htc.dupeCount
from HashTag ht
inner join (
    SELECT ht.Name, COUNT(*) AS dupeCount
    FROM HashTag ht
    GROUP BY ht.Name
    HAVING COUNT(*) > 1
) htc on ht.Name = htc.Name
ORDER BY Id

which gives

Id   |  Name    | dupeCount
1947 |  test    | 3
1950 |  sample  | 2
1962 |  test    | 3
1963 |  sample  | 2
1986 |  test    | 3
2014 |  example | 1

with my UPDATE and DELETE statements according to the dupeCount, but I'm not sure how to do this ;-)

Thanks in advance and best regards,

Michael


Solution

  • The first two update statements first get the name based on the hashtag_id (innermost select), then get the minimum of all ids in hashtag that share the same name (next select) to then update the hashtag_id accordingly. In this case, it will also update the records with hashtag_id 1947 and 1950 - but the new value will be identical to the old value.

    update HashTaggedGroup
     set hashtag_id = 
        (select min(id) 
         from hashtag h1 
         where (
            select name 
            from hashtag h2 
            where h2.id=HashTaggedGroup.hashtag_id)=h1.name);
    
    
    update HashTaggedActivity
     set hashtag_id = 
        (select min(id) 
         from hashtag h1 
         where (
            select name 
            from hashtag h2 
            where h2.id=HashTaggedActivity.hashtag_id)=h1.name);
    

    The delete as below will work for Mysql and SQLServer, it may need adjustment for other DBs (the idea remains the same though). If you are certain that all ids from hashtag are present in HashTaggedActivity, that would make it possible to have the query simpler.

    delete h1 from hashtag as h1 
        inner join hashtag as h2 on 
                  h1.name = h2.name and 
                  h1.id > h2.id;
    

    SQLFiddle for the above