sql database postgresql group-by aggregate-functions

Aggregated rollup in PostgreSQL with avg and sum in different levels?

I have a PostgreSQL query that aggregates data from multiple tables and generates a report. Currently it works perfectly fine and it is quite simple. Here is the query I'm using:

SELECT
  a.name AS account_name,
  c.name AS campaign_name,
  ad.name AS ad_name,
  g.cpm AS cpm,
  SUM(g.impressions) AS impressions,
  SUM(g.clicks) AS clicks,
  SUM(g.conversions) AS conversions,
  SUM(g.impressions) * g.cpm AS cost,
  SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= ? AND g.created_at <= ?
GROUP BY 1, 2, 3, 4
ORDER BY 2

And my table schema:

CREATE TABLE public.google_analytics (
  created_at TIMESTAMP WITHOUT TIME ZONE NULL,
  ad_id INTEGER NULL,
  impressions INTEGER NULL,
  clicks INTEGER NULL,
  conversions INTEGER NULL,
  ecpm INTEGER NULL,
  commission SMALLINT NULL,
  revenue INTEGER NULL
);

The current report aggregates data properly with the SUM and GROUP BY clauses, and works as expected. However, the application code extends this report by handling additional calculations and aggregations, such as rollups, averaging certain fields, and building a hierarchical structure.

For example, the application generates a report like this:

account_name	campaign_name	ad_name	cpm	impressions	clicks	conversions	cost	order_revenue
Account A			1.75 CPM	2200	20	2	1.45	$150
	Campaign A	Ad 1	1.50 CPM	1000	10	1	0.50	$100
	Campaign A	Ad 2	2.00 CPM	1200	10	1	2.40	$50

In this structure:

Impressions, clicks, and revenue are summed
CPM and cost are averaged
The data is grouped hierarchically by account, campaign, and ad

The application code handling these calculations is very large and complex, making it difficult to maintain. I am planning to replace this application logic with a SQL-based solution to simplify maintenance and improve performance.

I have experimented with the ROLLUP and CUBE features in PostgreSQL, and I got excited about the capabilities they offer. These features seem to perform much of the grunt work that the application currently does. My initial results were very promising and I got the very same tree-like structure for summing up all the metrics.

However, I am unsure if PostgreSQL's built-in rollup supports averaging and summing at different levels of the hierarchy. My goal is to use these features to replace the application code and handle the calculations directly within the database.

For instance:

I was unable to calculate the average instead of the sum: how can the rollup use average for a column, and sum for another column? It seems to always sum things
I was unable to calculate the sum of cost, since cost is already a calculated field

In order to contextualize even more, and sake of understanding the tree, this is the final JSON I am planning to generate:

[
  {
    "name": "Account A",
    "impressions": 236797,
    "cpm": 2.48,
    "children": [
      {
        "name": "Campaign A",
        "impressions": 236797,
        "cpm": 2.48,
        "children": [
          {
            "name": "Ad 1",
            "impressions": 236797,
            "cpm": 2.48,
            "children": []
          }
        ]
      }
    ]
  }
]

Solution

I am unsure if PostgreSQL's built-in rollup supports averaging and summing at different levels of the hierarchy.

It does: there's a grouping function that tells you which "level" you're on, and based on that you can alternate between the AVG and SUM the way you wanted: _{demo at db<>fiddle}

SELECT
  grouping(a.name, c.name, ad.name) as grouping,
  grouping(a.name, c.name, ad.name)::int::bit(3) as grouping_bitmask,
  a.name AS account_name,
  c.name AS campaign_name,
  ad.name AS ad_name,
  CASE grouping(a.name, c.name, ad.name)<>0
    WHEN true THEN AVG(g.cpm)
    ELSE SUM(g.cpm) 
  END AS cpm,
  AVG(g.cpm) AS avg_cpm,
  SUM(g.cpm) AS sum_cpm,
  SUM(g.impressions) AS impressions,
  SUM(g.clicks) AS clicks,
  SUM(g.conversions) AS conversions,
  CASE grouping(a.name, c.name, ad.name)<>0
    WHEN true THEN AVG(g.impressions*g.cpm)
    ELSE SUM(g.impressions*g.cpm)
  END AS cost,
  AVG(g.impressions*g.cpm) AS avg_cost,
  SUM(g.impressions*g.cpm) AS sum_cost,
  SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= (now()-'1 month'::interval) 
  AND g.created_at <= (now()-'2 weeks'::interval) 
GROUP BY ROLLUP(account_name, campaign_name, ad_name)
ORDER BY account_name, campaign_name, ad_name;

The avg_cpm and sum_cpm next to cpm (same for cost) shows you how that case alternates between them on different levels based on the grouping:

grouping	grouping_bitmask	account_name	campaign_name	ad_name	cpm	avg_cpm	sum_cpm	impressions	clicks	conversions	cost	avg_cost	sum_cost	order_revenue
0	000	account_1	campaign_1_account_1	ad_1_campaign_1_account_1	1600	1600.0	1600	7	19	1	11200	11200.0	11200	1000
1	001	account_1	campaign_1_account_1	null	1600.0	1600.0	1600	7	19	1	11200.0	11200.0	11200	1000
0	000	account_1	campaign_2_account_1	ad_1_campaign_2_account_1	800	800.0	800	15	11	19	12000	12000.0	12000	300
0	000	account_1	campaign_2_account_1	ad_2_campaign_2_account_1	400	400.0	400	0	5	13	0	0.00000	0	200
1	001	account_1	campaign_2_account_1	null	600.0	600.0	1200	15	16	32	6000.0	6000.0	12000	500
3	011	account_1	null	null	933.33	933.33	2800	22	35	33	7733.33	7733.33	23200	1500
0	000	account_2	campaign_2_account_2	ad_1_campaign_2_account_2	2500	2500.0	2500	18	19	7	45000	45000.0	45000	400
1	001	account_2	campaign_2_account_2	null	2500.0	2500.0	2500	18	19	7	45000.0	45000.0	45000	400
3	011	account_2	null	null	2500.0	2500.0	2500	18	19	7	45000.0	45000.0	45000	400
7	111	null	null	null	1325.0	1325.0	5300	40	54	40	17050.0	17050.0	68200	1900

That being said, it seems that you could just plug in a /count(*) to your SUMs: in individual rows, that'll be 1, so you'll keep getting the sum. On higher levels, it'll jump up to how many rows got aggregated, and SUM divided by COUNT will result in an AVG:

SELECT
  a.name AS account_name,
  c.name AS campaign_name,
  ad.name AS ad_name,
  SUM(g.cpm)/count(*) AS cpm,
  SUM(g.impressions) AS impressions,
  SUM(g.clicks) AS clicks,
  SUM(g.conversions) AS conversions,
  SUM(g.impressions*g.cpm)/count(*) AS cost,
  SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= (now()-'1 month'::interval) 
  AND g.created_at <= (now()-'2 weeks'::interval) 
GROUP BY ROLLUP(account_name, campaign_name, ad_name)
ORDER BY account_name, campaign_name, ad_name;