TL;DR I can't figure out how to write a recursive Postgres query that doesn't use aggregate functions in its recursive part. Is there an alternative way to write the recursive query shown below?
Let's say we have some sports:
CREATE TABLE sports (id INTEGER, name TEXT);
INSERT INTO sports VALUES (1, '100 meter sprint');
INSERT INTO sports VALUES (2, '400 meter sprint');
INSERT INTO sports VALUES (3, '50 meter swim');
INSERT INTO sports VALUES (4, '100 meter swim');
And some lap times for athletes competing in those sports:
CREATE TABLE lap_times (sport_id INTEGER, athlete TEXT, seconds NUMERIC);
INSERT INTO lap_times VALUES (1, 'Alice', 10);
INSERT INTO lap_times VALUES (1, 'Bob', 11);
INSERT INTO lap_times VALUES (1, 'Claire', 12);
INSERT INTO lap_times VALUES (2, 'Alice', 40);
INSERT INTO lap_times VALUES (2, 'Bob', 38);
INSERT INTO lap_times VALUES (2, 'Claire', 39);
INSERT INTO lap_times VALUES (3, 'Alice', 25);
INSERT INTO lap_times VALUES (3, 'Bob', 23);
INSERT INTO lap_times VALUES (3, 'Claire', 24);
INSERT INTO lap_times VALUES (4, 'Alice', 65);
INSERT INTO lap_times VALUES (4, 'Bob', 67);
INSERT INTO lap_times VALUES (4, 'Claire', 66);
We want to create some arbitrary categories:
CREATE TABLE categories (id INTEGER, name TEXT);
INSERT INTO categories VALUES (1, 'Running');
INSERT INTO categories VALUES (2, 'Swimming');
INSERT INTO categories VALUES (3, '100 meter');
And make our sports members of those categories:
CREATE TABLE memberships (category_id INTEGER, member_type TEXT, member_id INTEGER);
INSERT INTO memberships VALUES (1, 'Sport', 1);
INSERT INTO memberships VALUES (1, 'Sport', 2);
INSERT INTO memberships VALUES (2, 'Sport', 3);
INSERT INTO memberships VALUES (2, 'Sport', 4);
INSERT INTO memberships VALUES (3, 'Sport', 1);
INSERT INTO memberships VALUES (3, 'Sport', 4);
And we want a 'super' category that contains other categories:
INSERT INTO categories VALUES (4, 'Running + Swimming');
INSERT INTO memberships VALUES (4, 'Category', 1);
INSERT INTO memberships VALUES (4, 'Category', 2);
Now comes the tricky bit.
We want to rank our athletes on their lap times for each sport:
SELECT sport_id, athlete,
RANK() over(PARTITION BY sport_id ORDER BY seconds)
FROM lap_times lt;
But we also want to do this at a category level. When we do, the rank for the athlete should be based on their average rank across all sports in that category. For example:
Alice is 1st in 100 meter sprint and 3rd in 400 meter sprint
-> average rank: 2
Bob is 2nd in 100 meter sprint and 1st in 400 meter sprint
-> average rank: 1.5
Claire is 3rd in 100 meter sprint and 2nd in 400 meter sprint
-> average rank: 2.5
Ranking for running: 1st Bob, 2nd Alice, 3rd Claire
And for 'super' categories, the rank for the athlete should be based on their average rank across categories, NOT the underlying sports within those categories. i.e. it should only consider its direct children, rather than expanding out all the sports.
I tried my best to write a query to calculate these rankings. It's a recursive query that starts at the bottom with the sports and works its way up through memberships to calculate rankings for categories and 'super' categories. Here's my query:
WITH RECURSIVE rankings(rankable_type, rankable_id, athlete, value, rank) AS (
SELECT 'Sport', sport_id, athlete, seconds, RANK() over(PARTITION BY sport_id ORDER BY seconds)
FROM lap_times lt
UNION ALL
SELECT 'Category', category_id, athlete, avg(r.rank), RANK() OVER (PARTITION by category_id ORDER BY avg(r.rank))
FROM categories c
JOIN memberships m ON m.category_id = c.id
JOIN rankings r ON r.rankable_type = m.member_type AND r.rankable_id = m.member_id
GROUP BY category_id, athlete
)
SELECT * FROM rankings;
However, when I run it, I receive the following error:
ERROR: aggregate functions are not allowed in a recursive query's recursive term
This is cause by avg(r.rank)
in the recursive part of the query. Postgresql doesn't allow aggregate functions to be called in the recursive part of the query. Is there an alternative way to write this?
If I swap avg(r.rank), RANK() ...
out for NULL, NULL
the query executes and the results look correct for sports and it includes the expected number of rows for categories.
I thought about maybe trying to unwind the recursion to two or three levels using nested queries as that would be fine for my use case, but I thought I'd ask here first before trying that.
Another alternative might be to change the schema so it's less flexible so that sports cannot belong to multiple categories. I'm not sure how the query would look in that case, but it might be simpler?
Thanks in advance, I really appreciate it.
It's not pretty, but I found a solution:
WITH RECURSIVE rankings(rankable_type, rankable_id, athlete, value, rank) AS (
SELECT 'Sport', sport_id, athlete, seconds, RANK() over(PARTITION BY sport_id ORDER BY seconds)
FROM lap_times lt
UNION ALL
SELECT 'Category', *, rank() OVER(PARTITION by category_id ORDER BY avg_rank) FROM (
SELECT DISTINCT category_id, athlete, avg(r.rank) OVER (PARTITION by category_id, athlete) AS avg_rank
FROM categories c
JOIN memberships m ON m.category_id = c.id
JOIN rankings r ON r.rankable_type = m.member_type AND r.rankable_id = m.member_id
) _
)
SELECT * FROM rankings;
In the recursive part of the query, instead of calling GROUP BY
and calculating avg(r.rank)
, I use a window function partitioned on the same columns. This has the same effect of calculating the average rank.
One downside is that this calculation happens more times than is necessary. If we could GROUP BY
then avg(r.rank)
, that would be more efficient than avg(r.rank)
then GROUP BY
.
Since there are now duplicates in the result of the nested query, I'm using DISTINCT
to filter these out and then the outer query calculates a RANK()
of all athletes in each category_id
based on these averages.
I'd still be keen to hear if anyone knows of a better way to do this. Thanks