Search code examples
sqlt-sqlduplicate-datasql-delete

Tips for deduping a list based on priority column


I have a set of rows that contain duplicate entries because the data originates from multiples sources. I also have a separate reference table that indicates the priority of those data sources.

Does anyone have good tips for the most effective t-SQL to deduplicate this list?

Basically I have:

SELECT a.*, b.priority 
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId

I have been placing this into a temp table and then deleting in an odd way that I suppose could be more efficient.

DELETE ta
FROM #tmp ta
JOIN #tmp tb
ON ta.duplicateId = tb.duplicateId
WHERE ta.priority < tb.priority

Table A has the same columns for the two sources, but the data can differ - so they may have different prices. The challenge is that I must take the price (and all other info) from the row that comes from the source with the highest priority. To complicate matters, I do not have data for every item from ALL sources.

So, item 1 may have data from source A and B, while item 2 may only have it from source B and C. Thus, the deletion needs to happen on a per-unique item basis.


Solution

  • I think you could do something like this:

    SELECT a.*, b.priority 
    FROM tableA as a
    JOIN tableB as b
    ON b.sourceId = a.sourceId and b.priority = (select max(priority) from tableB where b.sourceId = a.sourceId)
    

    I can't remember if tSql will have a in scope for the subquery or not though.