sql postgresql left-join aggregate-functions

SQL: two aggregate functions without duplicates

How to use two aggregate functions (for example double string_agg or just sum), but ensure that the results do not contain duplicates caused by the other aggregate function (caused by second join)? I use PostgreSQL.

Example

I have three tables:

create table boxes
(
    id   bigserial primary key,
    name varchar(255)
);

create table animals
(
    id     bigserial primary key,
    name   varchar(255),
    age    numeric,
    box_id bigint constraint animals_boxes_id references boxes
);

create table vegetables
(
    id     bigserial primary key,
    name   varchar(255),
    weight numeric,
    box_id bigint constraint vegatables_box_id references boxes
);

Some input data:

insert into boxes (name) values ('First box');
insert into animals (box_id, name, age) values (1, 'Cat', 2);
insert into animals (box_id, name, age) values (1, 'Cat', 3);
insert into animals (box_id, name, age) values (1, 'Dog', 5);
insert into vegetables (box_id, name, weight) values (1, 'Tomato', 20);
insert into vegetables (box_id, name, weight) values (1, 'Cucumber', 30);
insert into vegetables (box_id, name, weight) values (1, 'Potato', 50);

And I want to get animal names in boxes:

select b.name                                 as box_name,
       string_agg(a.name, ', ' order by a.id) as animal_names
from boxes as b
         left join animals a on b.id = a.box_id
group by b.name;

It works:

box_name	animal_names
First box	Cat, Cat, Dog

But I also want to get vegetable names. But it doesn't work:

select b.name                                 as box_name,
       string_agg(a.name, ', ' order by a.id) as animal_names,
       string_agg(v.name, ', ' order by v.id) as vegatable_names
from boxes as b
         left join animals a on b.id = a.box_id
         left join vegetables v on b.id = v.box_id
group by b.name;

It produces duplicates in animal names and vegetable names:

box_name	animal_names	vegatable_names
First box	Cat, Cat, Cat, Cat, Cat, Cat, Dog, Dog, Dog	Tomato, Tomato, Tomato, Cucumber, Cucumber, Cucumber, Potato, Potato, Potato

Result should be:

box_name	animal_names	vegatable_names
First box	Cat, Cat, Dog	Tomato, Cucumber, Potato

I can't just add distinct to remove duplicates because:

Names in tables can duplicates (two animals with name Cat). If I use distinct it produces Cat, Dog instead of Cat, Cat, Dog.
I use order by in string_agg (it cause ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list when added distinct). Even if I remove order by (string_agg(distinct a.name, ', ')) I can't use it because first point.

Additional info

It applies to all aggregates function: string_agg, array_agg, json_object_agg and even sum.

Sum age of animals:

select sum(a.age)
from boxes as b
         left join animals a on b.id = a.box_id
         -- left join vegetables v on b.id = v.box_id
group by b.name;

Without second join it calculates correctly (10) but with it wrong (30) - due to duplicates.

Solution

The basic problem is explained here:

Two SQL LEFT JOINS produce incorrect result

For a small selection, aggregation per row is typically faster.
With LATERAL subqueries (more versatile):

SELECT b.name AS box_name, a.*, v.*
FROM   boxes b
LEFT  JOIN LATERAL (
   SELECT string_agg(a.name, ', ' ORDER BY a.id) AS animal_names
   FROM   animals a
   WHERE  a.box_id = b.id
   ) a ON true
LEFT  JOIN LATERAL (
   SELECT string_agg(v.name, ', ' ORDER BY v.id) AS vegetable_names
   FROM   vegetables v
   WHERE  v.box_id = b.id
   ) v ON true;

Or with correlated subqueries (simpler, often a bit faster):

SELECT b.name AS box_name
    , (SELECT string_agg(a.name, ', ' ORDER BY a.id)
       FROM   animals a
       WHERE  a.box_id = b.id)  AS animal_names
    , (SELECT string_agg(v.name, ', ' ORDER BY v.id)
       FROM   vegetables v
       WHERE  v.box_id = b.id) AS vegetable_names
FROM   boxes b;

See:

While aggregating the whole table, this is faster:

SELECT b.name AS box_name, a.animal_names, v.vegetable_names
FROM   boxes b
LEFT   JOIN (
   SELECT box_id, string_agg(a.name, ', ') AS animal_names   
   FROM  (
      SELECT box_id, id, name
      FROM   animals a
      ORDER  BY box_id, id
      ) a
   GROUP  BY 1
   ) a ON a.box_id = b.id
LEFT   JOIN (
   SELECT box_id, string_agg(v.name, ', ') AS vegetable_names 
   FROM  (
      SELECT box_id, id, name
      FROM   vegetables v
      ORDER  BY box_id, id
      ) v
   GROUP  BY 1   
   ) v ON v.box_id = b.id;

fiddle

Note how I sort in a subquery, which is typically faster than doing a per-aggregate sort. An optional optimization.

Aside: about varchar(255) in your test setup:

Should I add an arbitrary length limit to VARCHAR columns?