Search code examples
sqlpostgresqlcounthistogram

How to COUNT duplicate rows?


I want to be able to create a histogram out of a tuple containing two integers values.

Here it is the query:

 SELECT temp.ad_id, temp.distance  as hits FROM ( 
 'UNION ALL .join(cupound_query)' # python
) as temp GROUP BY temp.ad_id,temp.distance 

For this input:

(51, 5)
(51, 0)
(51, 3)
(51, 0)
(88, 2)
(88, 2)
(88, 2)
(84, 1)
(81, 9)

Would be:

(88,2) : 3
(51,0) : 2
(51,3) : 1
(51,5) : 1
(84,1) : 1
(81,9) : 1

How can I create a histogram of those values?
In other words, how can I count how many times a row has a duplicate?


Solution

  • The question leaves room for interpretation. This test case shows 2 nested steps:

    CREATE TABLE tbl (ad_id int, distance int);
    INSERT INTO tbl VALUES
      (510, 0), (956, 3), (823, 3), (880, 2)
    , (523, 3), (467, 0), (843, 1), (816, 9)
    , (533, 4), (721, 7), (288, 3), (900, 3)
    , (526, 9), (750, 7), (302, 8), (463, 6)
    , (742, 8), (804, 2), (62,  7), (880, 2)
    , (523, 3), (467, 0), (843, 1), (816, 9)
    , (533, 4), (721, 7), (288, 3), (900, 3)
    , (526, 9), (750, 7), (302, 8), (816, 9)
    , (533, 4), (721, 7), (288, 3), (900, 3)
    , (533, 4), (721, 7), (288, 3), (396, 5)
    ;
    

    How many duplicates per value?

    SELECT ad_id, count(*) AS ct FROM tbl GROUP BY 1;
    

    Result:

    ad_id  | ct
    -------+----
    62     | 1
    288    | 4
    302    | 2
    396    | 1
    ...
    

    Read: ad_id 62 exists 1x, ad_id 288 exists 4x, ...

    How to count how many times rows have duplicates?

    SELECT ct, count(*) AS ct_ct
    FROM  (SELECT ad_id, count(*) AS ct FROM tbl GROUP BY 1) sub
    GROUP  BY 1
    ORDER  BY 1;
    

    Result:

     ct | ct_ct
    ----+-------
    1   | 8
    2   | 7
    3   | 2
    4   | 3
    

    Read: 8 occurrences of "ad_id is unique", 7 occurrences of "2 rows with same ad_id", ...

    db<>fiddle here