Search code examples
sqlgreatest-n-per-groupamazon-athenaprestotrino

how to drop duplicate rows based on multiple column values in Amazon Athena?


I have a big table in Athena (200GB+) that has multiple columns and an ID column based on the combination of values of different columns, example below:

ID            col1         col2         col3         timestamp      ...      coln
v1v2TSvn        v1          v2            v3            TS                     vn
v1v2TSvn        v1          v2            v3            TS                     vn
v1v2TSvn        v1          v2            v3            TS                     vn
.
.

I noticed that I have unique repeated rows but had 1 single difference which is the Timestamp column (its the same data but for different timestamps)
Example:

v1v2TSvn        v1          v2            v3            9AM01                    vn
v1v2TSvn        v1          v2            v3            9AM02                    vn

I want to drop the extra duplicated rows and keep only the first appearance of the data row ( I want to keep the 9AM01 row)
I tried using the select distinct and failed on different other attempts
But was able to get at least the number of the repeated row for each unique row with this query:

SELECT "col1", "col2","col8","col13", 
    COUNT(*) AS CNT
FROM "mydatabase"."myTable"
GROUP BY "col1", "col2","col8","col13"
HAVING COUNT(*) > 1;

results:

ID            col1         col2         col8         coln         CNT
v1v2TSvn        v1          v2            v3            vn         3
v1v2TSvn        v1          v2            v3            vn         7
v1v2TSvn        v1          v2            v3            vn         2
.
.

and it actually gave me the expected results of the count of each row duplicates
my End objective is to get the rows that has no duplicates based on the values of for example these 4 columns.
Any Help please?

NOTE: most of my row values are STRINGS even the timestamp!


Solution

  • If you have only single column that you want to "exclude" from grouping then you can use min aggregate function. Something along these lines:

    SELECT "col1", "col2","col8","col13", 
       min(timestamp) timestamp
    FROM "mydatabase"."myTable"
    GROUP BY "col1", "col2","col8","col13";
    

    If you have only several columns that determine uniqueness and the rest should be the "first" then the easiest option would be to use row_number windows function (though for 1-2 extra columns previous approach can be updated to use min_by - select ..., min(timestamp) timestamp, min_by(colN, timestamp) colN, ...). Something like the following:

    select col1, col2, col8, col13, timestamp ....
    from (
        SELECT *,
           row_number() over(partition by col1, col2, col8, col13 order by timestamp) rn
        FROM "mydatabase"."myTable")
    where rn = 1