Search code examples
sqlarrayspostgresqlgraph-theoryrecursive-query

Group by array overlap in PostgreSQL


I want to write a query that groups rows by having overlapping arrays. Consider the following example:

 id | name    | family_ids
--------------------------
 1  | Alice   | [f1, f2]
 2  | Bob     | [f1]
 3  | Freddy  | [f2, f3]
 4  | James   | [f3]
 5  | Joe     | [f4, f5]
 6  | Tim     | [f5]

Alice and Bob are part of the same family, f1. In Alice's family in law (f2), she's also related to Freddy. And considering Freddy's family in law (f3), James is also related to them.

So, basically, I want to group by the arrays in family_ids having any overlap at all. But, notice that f2 -> f3 should also be discovered, which is not possible with 1 simple group by query.

ids           | family_ids
----------------------------
[1, 2, 3, 4]  | [f1, f2, f3]
[5, 6]        | [f4, f5]

I've been playing around a lot with inner joins, group by t1.family_ids && t2.family_ids but can't seem to find a good performing solution. The table right now has a size of ~100k rows. In the future, this table will grow towards ~500k-1M rows.


Solution

  • This is a graph-walking problem.

    A common approach is to unnest the arrays to generate the nodes, then self-join on matching families to compute all edges. We can then use a recursive query to traverse the graph, while taking care of not visiting the same node twice, and then aggregate to generate groups. The final steps is to recover the corresponding family ids.

    with recursive
        nodes as (
            select t.id, x.family_id
            from mytable t
            cross join lateral unnest(t.family_ids) as x(family_id)
        ),
        edges as (
            select n1.id as id1, n2.id as id2
            from nodes n1
            inner join nodes n2 using (family_id)
        ),
        cte as (
            select id1, id2, array[id1] as visited 
            from edges
            where id1 = id2
            union all 
            select c.id1, e.id2, c.visited || e.id2
            from cte c
            inner join edges e on e.id1 = c.id2
            where e.id2 <> all(c.visited)
        ),
        res as (
            select id1, array_agg(distinct id2 order by id2) as id2s
            from cte
            group by id1
        )
    select 
        array_agg(distinct n.id order by n.id) as ids, 
        array_agg(distinct n.family_id order by n.family_id) as family_ids
    from res r
    inner join nodes n on n.id = r.id1
    group by r.id2s
    

    Demo on DB Fiddlde:

    ids       | family_ids
    :-------- | :---------
    {1,2,3,4} | {f1,f2,f3}
    {5,6}     | {f4,f5}