Search code examples
sqlpostgresqlleft-joinaggregate-functions

SQL: two aggregate functions without duplicates


How to use two aggregate functions (for example double string_agg or just sum), but ensure that the results do not contain duplicates caused by the other aggregate function (caused by second join)? I use PostgreSQL.

Example

I have three tables:

create table boxes
(
    id   bigserial primary key,
    name varchar(255)
);

create table animals
(
    id     bigserial primary key,
    name   varchar(255),
    age    numeric,
    box_id bigint constraint animals_boxes_id references boxes
);

create table vegetables
(
    id     bigserial primary key,
    name   varchar(255),
    weight numeric,
    box_id bigint constraint vegatables_box_id references boxes
);

Some input data:

insert into boxes (name) values ('First box');
insert into animals (box_id, name, age) values (1, 'Cat', 2);
insert into animals (box_id, name, age) values (1, 'Cat', 3);
insert into animals (box_id, name, age) values (1, 'Dog', 5);
insert into vegetables (box_id, name, weight) values (1, 'Tomato', 20);
insert into vegetables (box_id, name, weight) values (1, 'Cucumber', 30);
insert into vegetables (box_id, name, weight) values (1, 'Potato', 50);

And I want to get animal names in boxes:

select b.name                                 as box_name,
       string_agg(a.name, ', ' order by a.id) as animal_names
from boxes as b
         left join animals a on b.id = a.box_id
group by b.name;

It works:

box_name animal_names
First box Cat, Cat, Dog

But I also want to get vegetable names. But it doesn't work:

select b.name                                 as box_name,
       string_agg(a.name, ', ' order by a.id) as animal_names,
       string_agg(v.name, ', ' order by v.id) as vegatable_names
from boxes as b
         left join animals a on b.id = a.box_id
         left join vegetables v on b.id = v.box_id
group by b.name;

It produces duplicates in animal names and vegetable names:

box_name animal_names vegatable_names
First box Cat, Cat, Cat, Cat, Cat, Cat, Dog, Dog, Dog Tomato, Tomato, Tomato, Cucumber, Cucumber, Cucumber, Potato, Potato, Potato

Result should be:

box_name animal_names vegatable_names
First box Cat, Cat, Dog Tomato, Cucumber, Potato

I can't just add distinct to remove duplicates because:

  • Names in tables can duplicates (two animals with name Cat). If I use distinct it produces Cat, Dog instead of Cat, Cat, Dog.
  • I use order by in string_agg (it cause ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list when added distinct). Even if I remove order by (string_agg(distinct a.name, ', ')) I can't use it because first point.

Additional info

It applies to all aggregates function: string_agg, array_agg, json_object_agg and even sum.

Sum age of animals:

select sum(a.age)
from boxes as b
         left join animals a on b.id = a.box_id
         -- left join vegetables v on b.id = v.box_id
group by b.name;

Without second join it calculates correctly (10) but with it wrong (30) - due to duplicates.


Solution

  • The basic problem is explained here:

    For a small selection, aggregation per row is typically faster.
    With LATERAL subqueries (more versatile):

    SELECT b.name AS box_name, a.*, v.*
    FROM   boxes b
    LEFT  JOIN LATERAL (
       SELECT string_agg(a.name, ', ' ORDER BY a.id) AS animal_names
       FROM   animals a
       WHERE  a.box_id = b.id
       ) a ON true
    LEFT  JOIN LATERAL (
       SELECT string_agg(v.name, ', ' ORDER BY v.id) AS vegetable_names
       FROM   vegetables v
       WHERE  v.box_id = b.id
       ) v ON true;
    

    Or with correlated subqueries (simpler, often a bit faster):

    SELECT b.name AS box_name
        , (SELECT string_agg(a.name, ', ' ORDER BY a.id)
           FROM   animals a
           WHERE  a.box_id = b.id)  AS animal_names
        , (SELECT string_agg(v.name, ', ' ORDER BY v.id)
           FROM   vegetables v
           WHERE  v.box_id = b.id) AS vegetable_names
    FROM   boxes b;
    

    See:

    While aggregating the whole table, this is faster:

    SELECT b.name AS box_name, a.animal_names, v.vegetable_names
    FROM   boxes b
    LEFT   JOIN (
       SELECT box_id, string_agg(a.name, ', ') AS animal_names   
       FROM  (
          SELECT box_id, id, name
          FROM   animals a
          ORDER  BY box_id, id
          ) a
       GROUP  BY 1
       ) a ON a.box_id = b.id
    LEFT   JOIN (
       SELECT box_id, string_agg(v.name, ', ') AS vegetable_names 
       FROM  (
          SELECT box_id, id, name
          FROM   vegetables v
          ORDER  BY box_id, id
          ) v
       GROUP  BY 1   
       ) v ON v.box_id = b.id;
    

    fiddle

    Note how I sort in a subquery, which is typically faster than doing a per-aggregate sort. An optional optimization.

    Aside: about varchar(255) in your test setup: