I have a table called cc_index and the table has among others the following columns: length, date, url, [other_colums].
The same url may appear in multiple rows. For each url, I only want to retain a single row (but all columns). I want to retain a row that has the longest length, unfortunately, this row may not be unique. In the cases, where multiple rows have the highest value for length for the same url, I want to retain the row that has the latest (highest) date, date values are unique for each row.
To clarify this is an example of a possible input table:
url | length | date | other |
---|---|---|---|
abc.com | 42 | 292 | adsk |
abc.com | 36 | 312 | sdlkf |
abc.com | 42 | 281 | sdjl |
def.com | 12 | 210 | tom |
def.com | 18 | 112 | vkr |
ghi.com | 29 | 292 | tom |
ghi.com | 29 | 103 | tyr |
ghi.com | 29 | 101 | vnv |
The output I desire is as follows:
url | length | date | other |
---|---|---|---|
abc.com | 42 | 292 | adsk |
def.com | 18 | 112 | vkr |
ghi.com | 29 | 292 | tom |
I know that I can run a query like this for deduplication
SELECT t1.*
FROM ccindex t1
INNER JOIN (SELECT url, MAX(length) AS max_length FROM ccindex GROUP BY url) t2
ON t1.url=t2.url AND t1.length = t2.max_length
However, in the case where the criterion is not unique (length) this will return all rows with the maximum length for each URL, and hence multiple rows for the same URL, i.e. the output will be
Undesired output:
url | length | date | other |
---|---|---|---|
abc.com | 42 | 292 | adsk |
abc.com | 42 | 281 | sdjl |
def.com | 18 | 112 | vkr |
ghi.com | 29 | 292 | tom |
ghi.com | 29 | 103 | tyr |
ghi.com | 29 | 101 | vnv |
How can I modify this query to select only a single row for each url that has length=max_length?
Follow-ups: If I don't care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?
We can use DENSE_RANK
here.
The PARTITION BY
part builds the ranking per url
, the ORDER BY
clause makes sure to begin with highest length
and if identic with the latest date
:
WITH rankedData AS
(SELECT
url, length, date, other,
DENSE_RANK() OVER(PARTITION BY url ORDER BY length DESC, date DESC) AS ranking
FROM ccindex)
SELECT
url, length, date, other
FROM rankedData
WHERE ranking = 1
ORDER BY url;
Add further columns to both sub query and main query if you want them to be shown.
To your question...
If I don't care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?
...I don't believe checking for the latest date will have much impact on your performance. But we don't know how big your table is, what indexes it has etc. So I recommend to simply try out both queries and compare the execution time.
The more important thing about performance is certainly to use an up-to-date version of your RDBMS. So this answer assumes (as your mentioned in your last comment) you already use it or you will upgrade.
Try out above queries on this sample fiddle with your data.
Have a look in the documentation about DENSE_RANK