I have a PostgreSQL query that aggregates data from multiple tables and generates a report. Currently it works perfectly fine and it is quite simple. Here is the query I'm using:
SELECT
a.name AS account_name,
c.name AS campaign_name,
ad.name AS ad_name,
g.cpm AS cpm,
SUM(g.impressions) AS impressions,
SUM(g.clicks) AS clicks,
SUM(g.conversions) AS conversions,
SUM(g.impressions) * g.cpm AS cost,
SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= ? AND g.created_at <= ?
GROUP BY 1, 2, 3, 4
ORDER BY 2
And my table schema:
CREATE TABLE public.google_analytics (
created_at TIMESTAMP WITHOUT TIME ZONE NULL,
ad_id INTEGER NULL,
impressions INTEGER NULL,
clicks INTEGER NULL,
conversions INTEGER NULL,
ecpm INTEGER NULL,
commission SMALLINT NULL,
revenue INTEGER NULL
);
The current report aggregates data properly with the SUM and GROUP BY clauses, and works as expected. However, the application code extends this report by handling additional calculations and aggregations, such as rollups, averaging certain fields, and building a hierarchical structure.
For example, the application generates a report like this:
account_name | campaign_name | ad_name | cpm | impressions | clicks | conversions | cost | order_revenue |
---|---|---|---|---|---|---|---|---|
Account A | 1.75 CPM | 2200 | 20 | 2 | 1.45 | $150 | ||
Campaign A | Ad 1 | 1.50 CPM | 1000 | 10 | 1 | 0.50 | $100 | |
Campaign A | Ad 2 | 2.00 CPM | 1200 | 10 | 1 | 2.40 | $50 |
In this structure:
The application code handling these calculations is very large and complex, making it difficult to maintain. I am planning to replace this application logic with a SQL-based solution to simplify maintenance and improve performance.
I have experimented with the ROLLUP and CUBE features in PostgreSQL, and I got excited about the capabilities they offer. These features seem to perform much of the grunt work that the application currently does. My initial results were very promising and I got the very same tree-like structure for summing up all the metrics.
However, I am unsure if PostgreSQL's built-in rollup supports averaging and summing at different levels of the hierarchy. My goal is to use these features to replace the application code and handle the calculations directly within the database.
For instance:
In order to contextualize even more, and sake of understanding the tree, this is the final JSON I am planning to generate:
[
{
"name": "Account A",
"impressions": 236797,
"cpm": 2.48,
"children": [
{
"name": "Campaign A",
"impressions": 236797,
"cpm": 2.48,
"children": [
{
"name": "Ad 1",
"impressions": 236797,
"cpm": 2.48,
"children": []
}
]
}
]
}
]
I am unsure if PostgreSQL's built-in rollup supports averaging and summing at different levels of the hierarchy.
It does: there's a grouping
function that tells you which "level" you're on, and based on that you can alternate between the AVG
and SUM
the way you wanted: demo at db<>fiddle
SELECT
grouping(a.name, c.name, ad.name) as grouping,
grouping(a.name, c.name, ad.name)::int::bit(3) as grouping_bitmask,
a.name AS account_name,
c.name AS campaign_name,
ad.name AS ad_name,
CASE grouping(a.name, c.name, ad.name)<>0
WHEN true THEN AVG(g.cpm)
ELSE SUM(g.cpm)
END AS cpm,
AVG(g.cpm) AS avg_cpm,
SUM(g.cpm) AS sum_cpm,
SUM(g.impressions) AS impressions,
SUM(g.clicks) AS clicks,
SUM(g.conversions) AS conversions,
CASE grouping(a.name, c.name, ad.name)<>0
WHEN true THEN AVG(g.impressions*g.cpm)
ELSE SUM(g.impressions*g.cpm)
END AS cost,
AVG(g.impressions*g.cpm) AS avg_cost,
SUM(g.impressions*g.cpm) AS sum_cost,
SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= (now()-'1 month'::interval)
AND g.created_at <= (now()-'2 weeks'::interval)
GROUP BY ROLLUP(account_name, campaign_name, ad_name)
ORDER BY account_name, campaign_name, ad_name;
The avg_cpm
and sum_cpm
next to cpm
(same for cost
) shows you how that case
alternates between them on different levels based on the grouping
:
grouping | grouping_bitmask | account_name | campaign_name | ad_name | cpm | avg_cpm | sum_cpm | impressions | clicks | conversions | cost | avg_cost | sum_cost | order_revenue |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000 | account_1 | campaign_1_account_1 | ad_1_campaign_1_account_1 | 1600 | 1600.0 | 1600 | 7 | 19 | 1 | 11200 | 11200.0 | 11200 | 1000 |
1 | 001 | account_1 | campaign_1_account_1 | null | 1600.0 | 1600.0 | 1600 | 7 | 19 | 1 | 11200.0 | 11200.0 | 11200 | 1000 |
0 | 000 | account_1 | campaign_2_account_1 | ad_1_campaign_2_account_1 | 800 | 800.0 | 800 | 15 | 11 | 19 | 12000 | 12000.0 | 12000 | 300 |
0 | 000 | account_1 | campaign_2_account_1 | ad_2_campaign_2_account_1 | 400 | 400.0 | 400 | 0 | 5 | 13 | 0 | 0.00000 | 0 | 200 |
1 | 001 | account_1 | campaign_2_account_1 | null | 600.0 | 600.0 | 1200 | 15 | 16 | 32 | 6000.0 | 6000.0 | 12000 | 500 |
3 | 011 | account_1 | null | null | 933.33 | 933.33 | 2800 | 22 | 35 | 33 | 7733.33 | 7733.33 | 23200 | 1500 |
0 | 000 | account_2 | campaign_2_account_2 | ad_1_campaign_2_account_2 | 2500 | 2500.0 | 2500 | 18 | 19 | 7 | 45000 | 45000.0 | 45000 | 400 |
1 | 001 | account_2 | campaign_2_account_2 | null | 2500.0 | 2500.0 | 2500 | 18 | 19 | 7 | 45000.0 | 45000.0 | 45000 | 400 |
3 | 011 | account_2 | null | null | 2500.0 | 2500.0 | 2500 | 18 | 19 | 7 | 45000.0 | 45000.0 | 45000 | 400 |
7 | 111 | null | null | null | 1325.0 | 1325.0 | 5300 | 40 | 54 | 40 | 17050.0 | 17050.0 | 68200 | 1900 |
That being said, it seems that you could just plug in a /count(*)
to your SUM
s: in individual rows, that'll be 1, so you'll keep getting the sum. On higher levels, it'll jump up to how many rows got aggregated, and SUM
divided by COUNT
will result in an AVG
:
SELECT
a.name AS account_name,
c.name AS campaign_name,
ad.name AS ad_name,
SUM(g.cpm)/count(*) AS cpm,
SUM(g.impressions) AS impressions,
SUM(g.clicks) AS clicks,
SUM(g.conversions) AS conversions,
SUM(g.impressions*g.cpm)/count(*) AS cost,
SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= (now()-'1 month'::interval)
AND g.created_at <= (now()-'2 weeks'::interval)
GROUP BY ROLLUP(account_name, campaign_name, ad_name)
ORDER BY account_name, campaign_name, ad_name;