I have a data table in Postgres containing a column age
that contains people ages. Unfortunately this data is dirty, so it includes string entries, which might represent ranges, featuring all of these formats:
21
25-30
30-35 years
Using Apache Superset I am making a chart from this data. I can apply functions on this column, so for example I am using this expression to get the lower bound of ranges split_part(split_part("age",' ',1),'-', 1)::int
.
However, the lower bound is not really representative of the data. I'd like when I have a range to get the average. I've tried:
avg(unnest(string_to_array(split_part("age",' ',1),'-')::int[]))
But I get:
aggregate function calls cannot contain set-returning function calls
LINE 1: SELECT avg(unnest(string_to_array(split_part("ageValue",' ',...
^
HINT: You might be able to move the set-returning function into a LATERAL FROM item.
This is available as a fiddle.
I've seen suggestions of how this can be done with a select or function, but I cannot use any of these in the Superset X-axis query. Can it be done with expressions only and if so - how?
Obviously, the silver bullet would be to clean up your data.
For a quick fix with a single expression:
SELECT *, CASE
WHEN age ~ ' ' THEN (split_part(age, '-', 1)::float8 + split_part(split_part(age, ' ', 1), '-', 2)::float8) / 2
WHEN age ~ '-' THEN (split_part(age, '-', 1)::float8 + split_part( age , '-', 2)::float8) / 2
ELSE age::float8
END AS avg_age
FROM data;
With aggregation:
SELECT id, age, avg(age_bound) AS avg_age
FROM (
SELECT *, string_to_table(split_part(age, ' ', 1), '-')::float8 AS age_bound
FROM data
) sub
GROUP BY id, age
ORDER BY id; -- optional
string_to_table()
was added with Postgres 14.