Search code examples
sqldb2

Calculating Cumulative Percentages in SQL


I have this table (myt):

CREATE TABLE myt (
  name VARCHAR(50),
  food VARCHAR(50),
  d1 INT
);

INSERT INTO myt (name, food, d1) VALUES
('john', 'pizza', 2010),
('john', 'pizza', 2011),
('john', 'cake', 2012),
('tim', 'apples', 2015),
('david', 'apples', 2020),
('david', 'apples', 2021),
('alex', 'cookies', 2005),
('alex', 'cookies', 2006);

  name    food   d1 food_year
  john   pizza 2010      2010
  john   pizza 2011      2011
  john    cake 2012      2012
   tim  apples 2015      2015
 david  apples 2020      2020
 david  apples 2021      2021
  alex cookies 2005      2005
  alex cookies 2006      2006

I wrote the following query to find out the percentage breakdowns of each food per name:

WITH FoodCounts AS (
    SELECT name, 
           food, 
           COUNT(*) as food_count
    FROM myt
    GROUP BY name, food
),
TotalCounts AS (
    SELECT name, 
           COUNT(*) as total_count
    FROM myt
    GROUP BY name
)
SELECT fc.name, 
       fc.food, 
       (fc.food_count * 100.0) / tc.total_count as percentage
FROM FoodCounts fc
JOIN TotalCounts tc
ON fc.name = tc.name;


  name    food percentage
  alex cookies  100.00000
 david  apples  100.00000
  john    cake   33.33333
  john   pizza   66.66667
   tim  apples  100.00000

I am now trying to modify this query to find out the cumulative percentages. For example, as of 2011 - what was John's food breakdown? As of 2012 what was John's food breakdown?

I tried to write a series of CTEs using window functions to answer this:

WITH YearlyFoodCounts AS (
    SELECT name, 
           food, 
           food_year,
           COUNT(*) as food_count
    FROM myt
    GROUP BY name, food, food_year
),
CumulativeCounts AS (
    SELECT name, 
           food_year,
           SUM(food_count) OVER (PARTITION BY name ORDER BY food_year) as cumulative_count
    FROM YearlyFoodCounts
)
SELECT yfc.name, 
       yfc.food, 
       yfc.food_year,
       yfc.food_count,
       cc.cumulative_count,
       (yfc.food_count * 100.0) / cc.cumulative_count as percentage
FROM YearlyFoodCounts yfc
JOIN CumulativeCounts cc
ON yfc.name = cc.name AND yfc.food_year = cc.food_year
ORDER BY yfc.name, yfc.food_year;

The result seems to be in the correct format:

 name    food food_year food_count cumulative_count percentage
  alex cookies      2005          1                1  100.00000
  alex cookies      2006          1                2   50.00000
 david  apples      2020          1                1  100.00000
 david  apples      2021          1                2   50.00000
  john   pizza      2010          1                1  100.00000
  john   pizza      2011          1                2   50.00000
  john    cake      2012          1                3   33.33333
   tim  apples      2015          1                1  100.00000

Is this the correct way to approach this problem?


Solution

  • You are over-complicating this. It doesn't need joins or subqueries, you can do it in a single level using window functions. You can put a normal aggregate inside a window function, as window functions run after normal aggregation.

    Note:

    • Use ROWS UNBOUNDED PRECEDING because the default is RANGE UNBOUNDED PRECEDING which is subtly different.
    • You say there can only be one food per name, food_year pair. So you should aggregate by only those two columns.
    SELECT
        name, 
        MIN(food) AS food, 
        food_year,
        COUNT(*) as food_count,
        SUM(COUNT(*)) OVER (PARTITION BY name ORDER BY food_year ROWS UNBOUNDED PRECEDING) as cumulative_count,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY name ORDER BY food_year ROWS UNBOUNDED PRECEDING) as percentage
    FROM myt
    GROUP BY
        name,
        food_year;
    

    db<>fiddle