Search code examples
sqlmysqlduplicatesgreatest-n-per-groupmysql-5.5

SQL deduplication with criteria that has non-unique values


I have a table called cc_index and the table has among others the following columns: length, date, url, [other_colums].

The same url may appear in multiple rows. For each url, I only want to retain a single row (but all columns). I want to retain a row that has the longest length, unfortunately, this row may not be unique. In the cases, where multiple rows have the highest value for length for the same url, I want to retain the row that has the latest (highest) date, date values are unique for each row.

To clarify this is an example of a possible input table:

url length date other
abc.com 42 292 adsk
abc.com 36 312 sdlkf
abc.com 42 281 sdjl
def.com 12 210 tom
def.com 18 112 vkr
ghi.com 29 292 tom
ghi.com 29 103 tyr
ghi.com 29 101 vnv

The output I desire is as follows:

url length date other
abc.com 42 292 adsk
def.com 18 112 vkr
ghi.com 29 292 tom

I know that I can run a query like this for deduplication

SELECT t1.*
FROM ccindex t1
INNER JOIN (SELECT url, MAX(length) AS max_length FROM ccindex GROUP BY url) t2
ON t1.url=t2.url AND t1.length = t2.max_length

However, in the case where the criterion is not unique (length) this will return all rows with the maximum length for each URL, and hence multiple rows for the same URL, i.e. the output will be

Undesired output:

url length date other
abc.com 42 292 adsk
abc.com 42 281 sdjl
def.com 18 112 vkr
ghi.com 29 292 tom
ghi.com 29 103 tyr
ghi.com 29 101 vnv

How can I modify this query to select only a single row for each url that has length=max_length?

Follow-ups: If I don't care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?


Solution

  • We can use DENSE_RANK here.

    The PARTITION BY part builds the ranking per url, the ORDER BY clause makes sure to begin with highest length and if identic with the latest date:

    WITH rankedData AS
    (SELECT
      url, length, date, other,
      DENSE_RANK() OVER(PARTITION BY url ORDER BY length DESC, date DESC) AS ranking
    FROM ccindex)
    SELECT
      url, length, date, other
    FROM rankedData 
    WHERE ranking = 1
    ORDER BY url;
    

    Add further columns to both sub query and main query if you want them to be shown.

    To your question...

    If I don't care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?

    ...I don't believe checking for the latest date will have much impact on your performance. But we don't know how big your table is, what indexes it has etc. So I recommend to simply try out both queries and compare the execution time.

    The more important thing about performance is certainly to use an up-to-date version of your RDBMS. So this answer assumes (as your mentioned in your last comment) you already use it or you will upgrade.

    Try out above queries on this sample fiddle with your data.

    Have a look in the documentation about DENSE_RANK