Search code examples
sqldistinctprestotrino

Presto | Athena: Duplicated values in not key column


I have the following table:

CD_CLIENTE DE_CLIENTE
12345 Jhon One
23456 Jhon Two
34567 Jhon Three
12345 Jhon ONEe

Even using DISTINCT for CD_CLIENT I have it two times in my Result because the CD_CLIENT is equal, but DE_CLIENTE is different

My question:

select distinct taxid as CD_CLIENTE, name as DE_CLIENTE
FROM my_table_name
group by distinct taxid, name

I need this query to achive this result, without the duplicated CD_CLIENTE:

CD_CLIENTE DE_CLIENTE
12345 Jhon One
23456 Jhon Two
34567 Jhon Three

I tryed this:

select distinct taxid as CD_CLIENTE, name as DE_CLIENTE
FROM my_table_name
group by distinct taxid, name
CD_CLIENTE DE_CLIENTE
12345 Jhon One
23456 Jhon Two
34567 Jhon Three
12345 Jhon ONEe

And need this:

CD_CLIENTE DE_CLIENTE
12345 Jhon One
23456 Jhon Two
34567 Jhon Three

Solution

  • You are grouping both by taxid, name so you will get all unique combinations of taxid and name in the output. DISTINCT is irrelevant here since it uses all columns in the select to perform comparisons for uniqueness (i.e. you can remove either group by or distinct the result will be the same).

    One option is to remove name from the group by clause and use aggregate like arbitrary or max:

    -- sample data
    WITH dataset(taxid, name) AS (
        VALUES (12345,  'Jhon One'),
            (23456, 'Jhon Two'),
            (34567, 'Jhon Three'),
            (12345, 'Jhon ONEe')
    )
    
    -- query
    select taxid, max(name)
    from dataset
    group by taxid;
    

    Output:

    taxid _col1
    12345 Jhon One
    34567 Jhon Three
    23456 Jhon Two

    if you have some third column you can use to resolve "collisions" for name you can use max_by/min_by (max_by(name, some_3rd_col)).