The problem
I have a sql database with a table for Hashtags, of which many are duplicates with regard to their names.
A statement like
SELECT *
FROM HashTag
ORDER BY Name
returns something like
Id | Name
1947 | test
1950 | sample
1962 | test
1963 | sample
1986 | test
2014 | example
I want to keep only the hashtag with the lowest Id for each Name (1947 for 'test' and 1950 for 'sample') and update other tables with this Id, replacing the higher Ids (example: updating hashtag 'test'; lowest Id = 1947, higher Ids = 1962, 1986). These sql statements are updated manually as of now and would be as follows:
UPDATE HashTaggedActivity
SET [HashTag_id] = 1947
WHERE HashTag_id in (1962, 1986)
Update HashTaggedGroup
SET [HashTag_id] = 1947
WHERE HashTag_id in (1962, 1986)
DELETE ht
FROM HashTag ht
WHERE ht.Id in (1962, 1986)
After this I have to do this for HashTag 'sample', which is an error prone and tedious process. The HashTag 'example' is not a duplicate and should not result in updating other tables.
Is there any way to write an sql statement for doing this for each occurence of duplicate names in the table HashTag?
What I've tried so far
I think I have to combine a statement for getting a duplicate count ordered by Id
select ht.Id, ht.Name, htc.dupeCount
from HashTag ht
inner join (
SELECT ht.Name, COUNT(*) AS dupeCount
FROM HashTag ht
GROUP BY ht.Name
HAVING COUNT(*) > 1
) htc on ht.Name = htc.Name
ORDER BY Id
which gives
Id | Name | dupeCount
1947 | test | 3
1950 | sample | 2
1962 | test | 3
1963 | sample | 2
1986 | test | 3
2014 | example | 1
with my UPDATE and DELETE statements according to the dupeCount, but I'm not sure how to do this ;-)
Thanks in advance and best regards,
Michael
The first two update statements first get the name based on the hashtag_id (innermost select), then get the minimum of all ids in hashtag that share the same name (next select) to then update the hashtag_id accordingly. In this case, it will also update the records with hashtag_id 1947 and 1950 - but the new value will be identical to the old value.
update HashTaggedGroup
set hashtag_id =
(select min(id)
from hashtag h1
where (
select name
from hashtag h2
where h2.id=HashTaggedGroup.hashtag_id)=h1.name);
update HashTaggedActivity
set hashtag_id =
(select min(id)
from hashtag h1
where (
select name
from hashtag h2
where h2.id=HashTaggedActivity.hashtag_id)=h1.name);
The delete as below will work for Mysql and SQLServer, it may need adjustment for other DBs (the idea remains the same though). If you are certain that all ids from hashtag are present in HashTaggedActivity, that would make it possible to have the query simpler.
delete h1 from hashtag as h1
inner join hashtag as h2 on
h1.name = h2.name and
h1.id > h2.id;