sql arrays postgresql graph-theory recursive-query

Group by array overlap in PostgreSQL

I want to write a query that groups rows by having overlapping arrays. Consider the following example:

 id | name    | family_ids
--------------------------
 1  | Alice   | [f1, f2]
 2  | Bob     | [f1]
 3  | Freddy  | [f2, f3]
 4  | James   | [f3]
 5  | Joe     | [f4, f5]
 6  | Tim     | [f5]

Alice and Bob are part of the same family, f1. In Alice's family in law (f2), she's also related to Freddy. And considering Freddy's family in law (f3), James is also related to them.

So, basically, I want to group by the arrays in family_ids having any overlap at all. But, notice that f2 -> f3 should also be discovered, which is not possible with 1 simple group by query.

ids           | family_ids
----------------------------
[1, 2, 3, 4]  | [f1, f2, f3]
[5, 6]        | [f4, f5]

I've been playing around a lot with inner joins, group by t1.family_ids && t2.family_ids but can't seem to find a good performing solution. The table right now has a size of ~100k rows. In the future, this table will grow towards ~500k-1M rows.

Solution

This is a graph-walking problem.

A common approach is to unnest the arrays to generate the nodes, then self-join on matching families to compute all edges. We can then use a recursive query to traverse the graph, while taking care of not visiting the same node twice, and then aggregate to generate groups. The final steps is to recover the corresponding family ids.

with recursive
    nodes as (
        select t.id, x.family_id
        from mytable t
        cross join lateral unnest(t.family_ids) as x(family_id)
    ),
    edges as (
        select n1.id as id1, n2.id as id2
        from nodes n1
        inner join nodes n2 using (family_id)
    ),
    cte as (
        select id1, id2, array[id1] as visited 
        from edges
        where id1 = id2
        union all 
        select c.id1, e.id2, c.visited || e.id2
        from cte c
        inner join edges e on e.id1 = c.id2
        where e.id2 <> all(c.visited)
    ),
    res as (
        select id1, array_agg(distinct id2 order by id2) as id2s
        from cte
        group by id1
    )
select 
    array_agg(distinct n.id order by n.id) as ids, 
    array_agg(distinct n.family_id order by n.family_id) as family_ids
from res r
inner join nodes n on n.id = r.id1
group by r.id2s

Demo on DB Fiddlde:

ids       | family_ids
:-------- | :---------
{1,2,3,4} | {f1,f2,f3}
{5,6}     | {f4,f5}