Search code examples
sqlpostgresqlapache-superset

Take average from a delimited string in a single expression


I have a data table in Postgres containing a column age that contains people ages. Unfortunately this data is dirty, so it includes string entries, which might represent ranges, featuring all of these formats:

21
25-30
30-35 years

Using Apache Superset I am making a chart from this data. I can apply functions on this column, so for example I am using this expression to get the lower bound of ranges split_part(split_part("age",' ',1),'-', 1)::int.

However, the lower bound is not really representative of the data. I'd like when I have a range to get the average. I've tried:

avg(unnest(string_to_array(split_part("age",' ',1),'-')::int[]))

But I get:

aggregate function calls cannot contain set-returning function calls
LINE 1: SELECT avg(unnest(string_to_array(split_part("ageValue",' ',...
               ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.

This is available as a fiddle.

I've seen suggestions of how this can be done with a select or function, but I cannot use any of these in the Superset X-axis query. Can it be done with expressions only and if so - how?


Solution

  • Obviously, the silver bullet would be to clean up your data.

    For a quick fix with a single expression:

    SELECT *, CASE
               WHEN age ~ ' ' THEN (split_part(age, '-', 1)::float8 + split_part(split_part(age, ' ', 1), '-', 2)::float8) / 2
               WHEN age ~ '-' THEN (split_part(age, '-', 1)::float8 + split_part(           age         , '-', 2)::float8) / 2
               ELSE age::float8
              END AS avg_age
    FROM   data;
    

    With aggregation:

    SELECT id, age, avg(age_bound) AS avg_age
    FROM  (
       SELECT *, string_to_table(split_part(age, ' ', 1), '-')::float8 AS age_bound
       FROM   data
       ) sub
    GROUP  BY id, age
    ORDER  BY id;  -- optional
    

    fiddle

    string_to_table() was added with Postgres 14.