sql group-by google-bigquery transitive-closure

Labelling groups of two columns in SQL (BigQuery SQL if possible)

Given a table

     name  ip 
A = |A     1  |
    |B     1  |
    |C     1  |
    |B     2  |
    |C     2  |
    |D     3  |
    |E     2  |

If any two names share same ip they belong in same group. Also ip with same name belong in same group. If you find all names for ip 1, {A, B, C}, then you should include all ips associated with {A,B,C} in that same group {1,2} and all then again all names with those ips that aren't already include {E} and so forth. In this particular example, anything in {A,B,C,E} x {1, 2} would be in the same group. The results for the above table would be

     name  ip  group
A = |A     1     1    |
    |B     1     1    |
    |C     1     1    |
    |B     2     1    |
    |C     2     1    |
    |D     3     2    |
    |E     2     1    |

Just to be clear:

If names A, B, and C are all ip 1 then they are grouped together and you should have

A, 1 = group1
B, 1 = group1
C, 1 = group1

If names A, B also share ip 2, then they should NOT make a new group but instead should should be in the same group like this:

A, 1 = group1
B, 1 = group1
C, 1 = group1
A, 2 = group1
B, 2 = group1

The goal is to solve this in Google BigQuery SQL.

So far I have

select ip, row_number() over () as group,
GROUP_CONCAT(name,',') as names,
from A
group by ip

which yields all of the names for an ip and gives a group, but doesn't find all the ips for a name or find the group for all pairs that encompasses all names and ips.

Note, you can use split to access names that are concatenated (in this case with a ',').

UPDATE - This is called transitive closure. If this is too difficult, it would be sufficient to show how to do just the first iteration of a transitive closure (how to find all the ips associated with all the names associated with each ip) and label these as groups.

Solution

Here is my solution for the first iteration. It is a bit long and might be improved, but this is what I have.

Step 1.

select name, nest(ip) ips, group_concat(string(ip)) sip from 
(select 'a' name, 1 ip),
(select 'b' name, 1 ip),
(select 'c' name, 1 ip),
(select 'b' name, 2 ip),
(select 'c' name, 2 ip),
(select 'd' name, 3 ip),
(select 'e' name, 2 ip)
group by name

Store the results in temporary table x

Step 2.

select a.name name, group_concat(b.name) as cluster from (
select a.name, b.name from (
select a.*, b.* from dataset.x a cross join dataset.x b
) omit record if every(not b.sip contains string(a.ips))
group by 1, 2 order by 1, 2) group by 1

Store the results in temporary table y

Step 3.

select cluster from (
select group_concat(part) cluster from (
select name, part from (
select a.name name, split(b.cluster) part 
from dataset.y a cross join dataset.y b
where b.cluster contains a.name) group by 1, 2 order by 1, 2) 
group by name) group by cluster

This should produce all unique clusters, i.e.

a,b,c,e
d