Search code examples
sqlwindow-functionsamazon-athenapresto

row_number is not unique for duplicate records


I am trying to find the latest update of a particular row from a bunch of rows per uuid. For that we use row_number() over a partition as shown below,

"row_number"() OVER (
    PARTITION BY "uuid"
    ORDER BY "uuid" ASC,
        "status"."sequence" DESC,
        "modifiedon" DESC
) "row_id_ranked",

After this, anything other than 'row_id_ranked = 1' are discarded. But when a retry/re-transmission happens, "status"."sequence" and "modifiedon" will be exactly same. This creates mutiple rows with row_id_ranked=1. Making it difficult to remove them. As per my understanding per the documentation, the row_number should be a unique value. But in case of these duplicates, it clearly is not.

row_number() → bigint#

Returns a unique, sequential number for each row, starting with one, according to the ordering of rows within the window partition.

How do I de-duplicate in this case?


Solution

  • Assuming that there does not exist some third column beyond the status sequence and modified on date to break a potential tie, and also assuming you don't care which single record gets retained, you may use RANDOM() here:

    ROW_NUMBER() OVER (PARTITION BY uuid
                       ORDER BY "status"."sequence" DESC, modifiedon DESC, RANDOM())
        AS row_id_ranked