Search code examples
sqlduplicatessnowflake-cloud-data-platform

How to find duplicates (correct way)?


I am using Snowflake database and ran this query to find total count, number of distinct records and difference:

select 
    (select count(*) from mytable) as total_count, 
    (select count(*) from (select distinct * from mytable)) as distinct_count,
    (select count(*) from mytable) - (select count(*) from (select distinct * from mytable)) as duplicate_count
from mytable limit 1;

Result:

1,759,867
1,738,924
20,943 (duplicate_count)

But when try with the other approach (group ALL columns and find where count is > 1):

select count(*) from (
SELECT 
    a, b, c, d, e,
    COUNT(*)
FROM 
    mytable
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1
)

I get 5,436.

Why there is a difference in number of duplicates? (20,943 vs 5,436)

Thanks.


Solution

  • Okay. Let's start from one simple example:

    create table #test
    (a int, b int, c int, d int, e int)
    
    insert into #test values (1,2,3,4,5)
    insert into #test values (1,2,3,4,5)
    insert into #test values (1,2,3,4,5)
    insert into #test values (1,2,3,4,5)
    insert into #test values (1,2,3,4,5)
    insert into #test values (5,4,3,2,1)
    insert into #test values (5,4,3,2,1)
    insert into #test values (1,1,1,1,1)
    

    And try your subquery to understand what you will get:

    SELECT 
        a, b, c, d, e,
        COUNT(*)
    FROM 
        #test
    GROUP BY 
        a, b, c, d, e
    HAVING 
        COUNT(*) > 1
    

    Think about a while...

    Dang Dang Dang Dang ~

    a   b   c   d   e   (No column name)
    1   2   3   4   5   5
    5   4   3   2   1   2
    

    It will only return two rows because you used 'group by'. But it still counted the duplicate numbers for each a,b,c,d,e combinations.

    If you want the total number of the duplicates, try this:

    select sum(sub_count) from (
    SELECT 
        a, b, c, d, e,
        COUNT(*) - 1 as sub_count
    FROM 
        #test
    GROUP BY 
        a, b, c, d, e
    HAVING 
        COUNT(*) > 1)a
    

    You need to minus one in this case if I understand your original queries correctly. Correct me if I am wrong.