Search code examples
rigraph

R find groups of tupples ignoring NAs


Based on an almost identical question, I am trying to create unique based on several columns where rows should grouped into the same ID if "there exists a path through any combination of the columns". The difference is that I have NAs that should not be used to link rows:

The goal is for R to create id3 based on id1 and id2, minimal example:

For example id1=1 is related to a and b of id2. But id1=2 is also related to a so both belong to one group (id3=group1). But since id1=2 and id1=3 share id2=c, also id1=3 belongs to that group (id3=1). The values of the tuple ((1,2),('a','b','c')) appear no where else, so no other row belongs to that group (which is labeled group1 generically).

library(igraph)
df = data.frame(id1 = c(1,1,2,2,3,3,4,4,5,5,6,6,NA,NA),
                id2 = c('a',NA,'a','c','c','d','x',NA,'y','z','x','z',NA,NA),
                id3 = c(rep('group1',6), rep('group2',6),NA,NA))

My solution fails with NA values.

g <- graph_from_data_frame(df, FALSE)
cg <- clusters(g)$membership
df$id4 <- cg[df$id1]
df

Obervation (row) 2 and 8 are linked because both have NA for id2, but this should be ignored. Is there a way t


Solution

  • You can try the code below using

    • components + memberships + merge
    g <- graph_from_data_frame(na.omit(df))
    merge(
      df,
      transform(
        rev(stack(membership(components(g))[V(g)[names(V(g)) %in% df$id1]])),
        values = paste0("group", values)
      ),
      by.x = "id1",
      by.y = "ind",
      all = TRUE
    )
    

    or

    • decompose + merge
    subg <- decompose(graph_from_data_frame(na.omit(df)))
    merge(df,
      do.call(
        rbind,
        Map(
          function(x, y) cbind(setNames(unique(as_data_frame(x)[1]), "id1"), id3 = y),
          subg,
          paste0("group", seq_along(subg))
        )
      ),
      by = "id1",
      all = TRUE
    )
    

    which gives you

       id1  id2    id3
    1    1    a group1
    2    1 <NA> group1
    3    2    a group1
    4    2    c group1
    5    3    c group1
    6    3    d group1
    7    4    x group2
    8    4 <NA> group2
    9    5    y group2
    10   5    z group2
    11   6    x group2
    12   6    z group2
    13  NA <NA>   <NA>
    14  NA <NA>   <NA>