Search code examples
sql-serverduplicatesmax

T-SQL Delete duplicates leaving the one with max value from another table


I'd like to ask you for a help in creating a query to remove the duplicates, but leaving the record with max value from another table.

I have a particular design of a SQL Server database.

Table Main:

|ID|  URL     | some other columns
+--+----------+
|1 |  URL1    |
|2 |  URL1    | 
|3 |  URL2    |

Table Populars:

|ID|  MainID  |UseCount | SearchString
+--+----------+---------+--------------
|1 |  3       | 124     | Phrase1
|2 |  2       | 234     | Phrase2
|3 |  1       | 123     | Phrase3
|4 |  1       | 7       | Phrase4

So in Main table I've got list of URLs where some of URLs are duplicated. In Populars table I've got records with UseCount of each URL - there may be more than one record here for one URL, where Main.ID = Populars.MainID, but there may be also no related record in Populars for record in Main.

The problem is to remove duplicates from Main leaving the one which has the highest UseCount in Populars.

I tried multiple approaches, but didn't find the solution...

This query:

WITH ToDelete AS 
(
    SELECT 
        *,
        RN = ROW_NUMBER() OVER (PARTITION BY URL ORDER BY id) 
    FROM 
        Main
)
DELETE FROM ToDelete 
WHERE RN > 1

removes duplicates properly, but leaves 'random' (from perspective of UseCount) record.

I have also an idea to create an additional column in the Main table, match all URLs in Main with MAX(Usecount) from Populars and insert this value into the new column and then try to remove duplicates. This would resolve the problem with multiple records in Populars related to each URL in main...

However I'm not very experienced in SQL queries...

At the very end orphan records (which have no related record in main) from Populars should be also removed.

I haven't found anything suitable for my problem - all similar questions are related to data in just one table - I'm struggling with it for few days trying to find a solution.

Do you have some ideas to reach the result?


Solution

  • OK, I've found a solution.

    I've created a view:

    create view [dbo].[MainWithUseCount] 
    as
    select  *, (select max(usecount) as MaxUseCont from [dbo].[Populars] where MainID = ID) as MaxUseCount from Main 
    GO
    

    And after that, used:

            WITH ToDelete AS 
            (SELECT *,RN=ROW_NUMBER() OVER(PARTITION BY URL ORDER BY MaxUseCount DESC) 
             FROM MainWithUseCount)
     delete FROM ToDelete WHERE RN>1