I want to write a query that groups rows by having overlapping arrays. Consider the following example:
id | name | family_ids
--------------------------
1 | Alice | [f1, f2]
2 | Bob | [f1]
3 | Freddy | [f2, f3]
4 | James | [f3]
5 | Joe | [f4, f5]
6 | Tim | [f5]
Alice and Bob are part of the same family, f1
. In Alice's family in law (f2
), she's also related to Freddy. And considering Freddy's family in law (f3
), James is also related to them.
So, basically, I want to group by the arrays in family_ids
having any overlap at all. But, notice that f2 -> f3
should also be discovered, which is not possible with 1 simple group by
query.
ids | family_ids
----------------------------
[1, 2, 3, 4] | [f1, f2, f3]
[5, 6] | [f4, f5]
I've been playing around a lot with inner join
s, group by t1.family_ids && t2.family_ids
but can't seem to find a good performing solution. The table right now has a size of ~100k rows. In the future, this table will grow towards ~500k-1M rows.
This is a graph-walking problem.
A common approach is to unnest the arrays to generate the nodes, then self-join on matching families to compute all edges. We can then use a recursive query to traverse the graph, while taking care of not visiting the same node twice, and then aggregate to generate groups. The final steps is to recover the corresponding family ids.
with recursive
nodes as (
select t.id, x.family_id
from mytable t
cross join lateral unnest(t.family_ids) as x(family_id)
),
edges as (
select n1.id as id1, n2.id as id2
from nodes n1
inner join nodes n2 using (family_id)
),
cte as (
select id1, id2, array[id1] as visited
from edges
where id1 = id2
union all
select c.id1, e.id2, c.visited || e.id2
from cte c
inner join edges e on e.id1 = c.id2
where e.id2 <> all(c.visited)
),
res as (
select id1, array_agg(distinct id2 order by id2) as id2s
from cte
group by id1
)
select
array_agg(distinct n.id order by n.id) as ids,
array_agg(distinct n.family_id order by n.family_id) as family_ids
from res r
inner join nodes n on n.id = r.id1
group by r.id2s
ids | family_ids :-------- | :--------- {1,2,3,4} | {f1,f2,f3} {5,6} | {f4,f5}