Search code examples
sqlnetezza

SQL: Counting the Number of Times a "Type" of Row Appears


I am working with Netezza SQL. I have the following table:

CREATE TABLE MY_TABLE (
    id VARCHAR(50),
    year VARCHAR(50)
);

INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2010');
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2012'); 

INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2013');


INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2013');

  INSERT INTO MY_TABLE (id, year)
    VALUES ('127', '2011');
    INSERT INTO MY_TABLE (id, year)
    VALUES ('127', '2012');
    INSERT INTO MY_TABLE (id, year)
    VALUES ('127', '2015');

  INSERT INTO MY_TABLE (id, year)
    VALUES ('126', '2019');

My Question: I am trying to find out the number of times each combination of years appears. The final result would look like this:

       combination freq
1   2011,2012,2015    1
2   2011,2012,2013    2
3 2010, 2011, 2012    1
4             2019    1

Here is my attempt at doing this in Netezza:

WITH CTE AS (
    SELECT id, year, ROW_NUMBER() OVER (PARTITION BY id ORDER BY year) AS rn
    FROM MY_TABLE
),
CTE2 AS (
    SELECT id, MAX(rn) AS max_rn
    FROM CTE
    GROUP BY id
),
CTE3 AS (
    SELECT CTE2.id, CTE.year, CTE.rn, CTE2.max_rn
    FROM CTE2
    JOIN CTE ON CTE2.id = CTE.id
),
CTE4 AS (
    SELECT id,
        MAX(CASE WHEN rn = 1 THEN year END) ||
        MAX(CASE WHEN rn = 2 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 3 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 4 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 5 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 6 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 7 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 8 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 9 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 10 THEN ',' || year END)
        AS combination
    FROM CTE3
    GROUP BY id
)
SELECT combination, COUNT(*) AS freq
FROM CTE4
GROUP BY combination
ORDER BY freq DESC;

But the calculations are completely wrong:

                                       combination freq
1                                              <NA>   94
2 2010,2012,2013,2014,2015,2016,2017,2018,2019,2020    1
3 2010,2011,2012,2013,2014,2015,2016,2017,2019,2020    1
4 2010,2011,2012,2013,2014,2016,2017,2018,2019,2020    1
5 2010,2011,2012,2014,2015,2016,2017,2018,2019,2020    1
6 2010,2011,2013,2014,2015,2016,2017,2018,2019,2020    1
7 2010,2011,2012,2013,2014,2015,2016,2017,2018,2019    1

Can someone please show me how to fix this?

Thanks!

  • Note 1: I am using rn = 10 to account for the min/max range of years (which in my case is 10). I only have years 2010, 2011, ... 2020 in my dataset. Based on this information - is it possible to write a manual solution for individual years using UNION ALL style statements?

  • Note 2: Unfortunately, there is no GROUP_CONCAT function in Netezza - otherwise the following code would have worked:

Here is a sample:

  SELECT GROUP_CONCAT(year, ',' ORDER BY year) AS combination, COUNT(*) AS freq
    FROM my_data
    GROUP BY id
    ORDER BY freq DESC;
  • Note 3: Here is R code to accomplish a similar task:

Source: Counting Number of Unique Column Values Per Group

agg <- aggregate(year ~ id, my_table, paste, collapse = ", ")
final = as.data.frame(table(agg$year))

Solution

  • Since Netezza lacks LISTAGG, STRING_AGG, and GROUP_CONCAT, this is a bit more challenging.

    Reading that Netezza software was based on PostgreSQL 7.2, I created this against 9.6 -- the oldest available on dbfiddle.uk.

    CREATE TABLE MY_TABLE (
        id VARCHAR(50),
        year VARCHAR(50)
    );
    
    INSERT INTO MY_TABLE (id, year) VALUES ('123', '2010');
    INSERT INTO MY_TABLE (id, year) VALUES ('123', '2011');
    INSERT INTO MY_TABLE (id, year) VALUES ('123', '2012'); 
    
    INSERT INTO MY_TABLE (id, year) VALUES ('124', '2011');
    INSERT INTO MY_TABLE (id, year) VALUES ('124', '2012');
    INSERT INTO MY_TABLE (id, year) VALUES ('124', '2013');
    
    INSERT INTO MY_TABLE (id, year) VALUES ('125', '2011');
    INSERT INTO MY_TABLE (id, year) VALUES ('125', '2012');
    INSERT INTO MY_TABLE (id, year) VALUES ('125', '2013');
    
    INSERT INTO MY_TABLE (id, year) VALUES ('127', '2011');
    INSERT INTO MY_TABLE (id, year) VALUES ('127', '2012');
    INSERT INTO MY_TABLE (id, year) VALUES ('127', '2015');
    
    INSERT INTO MY_TABLE (id, year) VALUES ('126', '2019');
    
    
    WITH CTE AS (
      SELECT id, year, ROW_NUMBER() OVER (PARTITION BY id ORDER BY year) AS rn
      FROM MY_TABLE
    ),
    CTE3 AS (
      select id
      , year
      , rn
      , max(rn) over (partition by id) as MaxRN
      from cte
    ),
    CTE4 AS (  
      SELECT id
      , coalesce(max(CASE WHEN rn = 1 THEN year END), '') ||
        coalesce(max(CASE WHEN rn = 2  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 3  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 4  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 5  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 6  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 7  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 8  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 9  THEN ',' || year END), '') ||
        coalesce(max(CASE WHEN rn = 10 THEN ',' || year END), '')
        AS combination
      FROM CTE3
      GROUP BY id
    )
    
    SELECT combination
    , count(id) as freq
    FROM CTE4
    GROUP BY combination
    

    COALESCE() is needed because SELECT 'a' || NULL returns NULL. Without this the result would be

    combination freq
    null 5