Given a table
name ip
A = |A 1 |
|B 1 |
|C 1 |
|B 2 |
|C 2 |
|D 3 |
|E 2 |
If any two names share same ip they belong in same group. Also ip with same name belong in same group. If you find all names for ip 1, {A, B, C}, then you should include all ips associated with {A,B,C} in that same group {1,2} and all then again all names with those ips that aren't already include {E} and so forth. In this particular example, anything in {A,B,C,E} x {1, 2} would be in the same group. The results for the above table would be
name ip group
A = |A 1 1 |
|B 1 1 |
|C 1 1 |
|B 2 1 |
|C 2 1 |
|D 3 2 |
|E 2 1 |
Just to be clear:
If names A, B, and C are all ip 1 then they are grouped together and you should have
A, 1 = group1
B, 1 = group1
C, 1 = group1
If names A, B also share ip 2, then they should NOT make a new group but instead should should be in the same group like this:
A, 1 = group1
B, 1 = group1
C, 1 = group1
A, 2 = group1
B, 2 = group1
The goal is to solve this in Google BigQuery SQL.
So far I have
select ip, row_number() over () as group,
GROUP_CONCAT(name,',') as names,
from A
group by ip
which yields all of the names for an ip and gives a group, but doesn't find all the ips for a name or find the group for all pairs that encompasses all names and ips.
Note, you can use split to access names that are concatenated (in this case with a ',').
UPDATE - This is called transitive closure. If this is too difficult, it would be sufficient to show how to do just the first iteration of a transitive closure (how to find all the ips associated with all the names associated with each ip) and label these as groups.
Here is my solution for the first iteration. It is a bit long and might be improved, but this is what I have.
Step 1.
select name, nest(ip) ips, group_concat(string(ip)) sip from
(select 'a' name, 1 ip),
(select 'b' name, 1 ip),
(select 'c' name, 1 ip),
(select 'b' name, 2 ip),
(select 'c' name, 2 ip),
(select 'd' name, 3 ip),
(select 'e' name, 2 ip)
group by name
Store the results in temporary table x
Step 2.
select a.name name, group_concat(b.name) as cluster from (
select a.name, b.name from (
select a.*, b.* from dataset.x a cross join dataset.x b
) omit record if every(not b.sip contains string(a.ips))
group by 1, 2 order by 1, 2) group by 1
Store the results in temporary table y
Step 3.
select cluster from (
select group_concat(part) cluster from (
select name, part from (
select a.name name, split(b.cluster) part
from dataset.y a cross join dataset.y b
where b.cluster contains a.name) group by 1, 2 order by 1, 2)
group by name) group by cluster
This should produce all unique clusters, i.e.
a,b,c,e
d