Search code examples
sqlsql-serverduplicate-datasql-delete

SQL Duplicate Delete Query over Millions of Rows for Performance


This has been an adventure. I started with the looping duplicate query located in my previous question, but each loop would go over all 17 million records, meaning it would take weeks (just running *select count * from MyTable* takes my server 4:30 minutes using MSSQL 2005). I gleamed information from this site and at this post.

And have arrived at the query below. The question is, is this the correct type of query to run on 17 million records for any type of performance? If it isn't, what is?

SQL QUERY:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    EXCEPT
    SELECT RecordID
    FROM (
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude,           Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    ) al WHERE Rank = 1)

Solution

  • Seeing the QueryPlan would help.

    Is this feasible?

    SELECT m.*
    into #temp
    FROM tl_acxiomimport.dbo.tblacxiomlistings m 
    inner join (SELECT RecordID, 
                       Rank() over (Partition BY BusinessName, 
                                                 latitude,  
                                                 longitude,            
                                                 Phone  
                                    ORDER BY webaddress DESC,  
                                             caption1 DESC,  
                                             caption2 DESC ) AS Rank
                  FROM tl_acxiomimport.dbo.tblacxiomlistings
               ) al on (al.RecordID = m.RecordID and al.Rank = 1)
    
    truncate table tl_acxiomimport.dbo.tblacxiomlistings
    
    insert into tl_acxiomimport.dbo.tblacxiomlistings
         select * from #temp