Search code examples
sqlgroup-bygoogle-bigquerytransitive-closure

Labelling groups of two columns in SQL (BigQuery SQL if possible)


Given a table

     name  ip 
A = |A     1  |
    |B     1  |
    |C     1  |
    |B     2  |
    |C     2  |
    |D     3  |
    |E     2  |

If any two names share same ip they belong in same group. Also ip with same name belong in same group. If you find all names for ip 1, {A, B, C}, then you should include all ips associated with {A,B,C} in that same group {1,2} and all then again all names with those ips that aren't already include {E} and so forth. In this particular example, anything in {A,B,C,E} x {1, 2} would be in the same group. The results for the above table would be

     name  ip  group
A = |A     1     1    |
    |B     1     1    |
    |C     1     1    |
    |B     2     1    |
    |C     2     1    |
    |D     3     2    |
    |E     2     1    |

Just to be clear:

If names A, B, and C are all ip 1 then they are grouped together and you should have

A, 1 = group1
B, 1 = group1
C, 1 = group1

If names A, B also share ip 2, then they should NOT make a new group but instead should should be in the same group like this:

A, 1 = group1
B, 1 = group1
C, 1 = group1
A, 2 = group1
B, 2 = group1

The goal is to solve this in Google BigQuery SQL.

So far I have

select ip, row_number() over () as group,
GROUP_CONCAT(name,',') as names,
from A
group by ip

which yields all of the names for an ip and gives a group, but doesn't find all the ips for a name or find the group for all pairs that encompasses all names and ips.

Note, you can use split to access names that are concatenated (in this case with a ',').

UPDATE - This is called transitive closure. If this is too difficult, it would be sufficient to show how to do just the first iteration of a transitive closure (how to find all the ips associated with all the names associated with each ip) and label these as groups.


Solution

  • Here is my solution for the first iteration. It is a bit long and might be improved, but this is what I have.

    Step 1.

    select name, nest(ip) ips, group_concat(string(ip)) sip from 
    (select 'a' name, 1 ip),
    (select 'b' name, 1 ip),
    (select 'c' name, 1 ip),
    (select 'b' name, 2 ip),
    (select 'c' name, 2 ip),
    (select 'd' name, 3 ip),
    (select 'e' name, 2 ip)
    group by name
    

    Store the results in temporary table x

    Step 2.

    select a.name name, group_concat(b.name) as cluster from (
    select a.name, b.name from (
    select a.*, b.* from dataset.x a cross join dataset.x b
    ) omit record if every(not b.sip contains string(a.ips))
    group by 1, 2 order by 1, 2) group by 1
    

    Store the results in temporary table y

    Step 3.

    select cluster from (
    select group_concat(part) cluster from (
    select name, part from (
    select a.name name, split(b.cluster) part 
    from dataset.y a cross join dataset.y b
    where b.cluster contains a.name) group by 1, 2 order by 1, 2) 
    group by name) group by cluster
    

    This should produce all unique clusters, i.e.

    a,b,c,e
    d