How to use two aggregate functions (for example double string_agg
or just sum
), but ensure that the results do not contain duplicates caused by the other aggregate function (caused by second join)? I use PostgreSQL.
I have three tables:
create table boxes
(
id bigserial primary key,
name varchar(255)
);
create table animals
(
id bigserial primary key,
name varchar(255),
age numeric,
box_id bigint constraint animals_boxes_id references boxes
);
create table vegetables
(
id bigserial primary key,
name varchar(255),
weight numeric,
box_id bigint constraint vegatables_box_id references boxes
);
Some input data:
insert into boxes (name) values ('First box');
insert into animals (box_id, name, age) values (1, 'Cat', 2);
insert into animals (box_id, name, age) values (1, 'Cat', 3);
insert into animals (box_id, name, age) values (1, 'Dog', 5);
insert into vegetables (box_id, name, weight) values (1, 'Tomato', 20);
insert into vegetables (box_id, name, weight) values (1, 'Cucumber', 30);
insert into vegetables (box_id, name, weight) values (1, 'Potato', 50);
And I want to get animal names in boxes:
select b.name as box_name,
string_agg(a.name, ', ' order by a.id) as animal_names
from boxes as b
left join animals a on b.id = a.box_id
group by b.name;
It works:
box_name | animal_names |
---|---|
First box | Cat, Cat, Dog |
But I also want to get vegetable names. But it doesn't work:
select b.name as box_name,
string_agg(a.name, ', ' order by a.id) as animal_names,
string_agg(v.name, ', ' order by v.id) as vegatable_names
from boxes as b
left join animals a on b.id = a.box_id
left join vegetables v on b.id = v.box_id
group by b.name;
It produces duplicates in animal names and vegetable names:
box_name | animal_names | vegatable_names |
---|---|---|
First box | Cat, Cat, Cat, Cat, Cat, Cat, Dog, Dog, Dog | Tomato, Tomato, Tomato, Cucumber, Cucumber, Cucumber, Potato, Potato, Potato |
Result should be:
box_name | animal_names | vegatable_names |
---|---|---|
First box | Cat, Cat, Dog | Tomato, Cucumber, Potato |
I can't just add distinct
to remove duplicates because:
Cat
). If I use distinct
it produces Cat, Dog
instead of Cat, Cat, Dog
.order by
in string_agg
(it cause ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list
when added distinct
). Even if I remove order by
(string_agg(distinct a.name, ', ')
) I can't use it because first point.It applies to all aggregates function: string_agg
, array_agg
, json_object_agg
and even sum
.
Sum age of animals:
select sum(a.age)
from boxes as b
left join animals a on b.id = a.box_id
-- left join vegetables v on b.id = v.box_id
group by b.name;
Without second join it calculates correctly (10
) but with it wrong (30
) - due to duplicates.
The basic problem is explained here:
For a small selection, aggregation per row is typically faster.
With LATERAL
subqueries (more versatile):
SELECT b.name AS box_name, a.*, v.*
FROM boxes b
LEFT JOIN LATERAL (
SELECT string_agg(a.name, ', ' ORDER BY a.id) AS animal_names
FROM animals a
WHERE a.box_id = b.id
) a ON true
LEFT JOIN LATERAL (
SELECT string_agg(v.name, ', ' ORDER BY v.id) AS vegetable_names
FROM vegetables v
WHERE v.box_id = b.id
) v ON true;
Or with correlated subqueries (simpler, often a bit faster):
SELECT b.name AS box_name
, (SELECT string_agg(a.name, ', ' ORDER BY a.id)
FROM animals a
WHERE a.box_id = b.id) AS animal_names
, (SELECT string_agg(v.name, ', ' ORDER BY v.id)
FROM vegetables v
WHERE v.box_id = b.id) AS vegetable_names
FROM boxes b;
See:
While aggregating the whole table, this is faster:
SELECT b.name AS box_name, a.animal_names, v.vegetable_names
FROM boxes b
LEFT JOIN (
SELECT box_id, string_agg(a.name, ', ') AS animal_names
FROM (
SELECT box_id, id, name
FROM animals a
ORDER BY box_id, id
) a
GROUP BY 1
) a ON a.box_id = b.id
LEFT JOIN (
SELECT box_id, string_agg(v.name, ', ') AS vegetable_names
FROM (
SELECT box_id, id, name
FROM vegetables v
ORDER BY box_id, id
) v
GROUP BY 1
) v ON v.box_id = b.id;
Note how I sort in a subquery, which is typically faster than doing a per-aggregate sort. An optional optimization.
Aside: about varchar(255)
in your test setup: