Search code examples
rrecord-linkage

How to create a common ID for 2 two groups that share members in common


I am trying to create a new ID for groups that should be linked together as they share common members. Example below shows the problem.

The first table tells me that dedupe id 1 has members 12,23 and 34

Dedupe ID # Member
1 12
1 23
1 34
2 56
... ...

While the second table tells me dedupe id 5 has members 12, 23 and 47. Noting that 12 and 23 are common with dedupe id 1 in the first table.

Dedupe ID # Member
5 12
5 23
5 47
2 7
... ...

As groups 1 and 5 overlap - I would like to ensure that members 12, 23, 34 and 47 are joined into 1 dedupe id, like below

dedupe ID # Member
1 12
1 23
1 34
1 47
2 56
... ...

@akrun, good question. Maybe the better way of doing this is to update table 1 so that dedupeid 1 maps to Member 47 (see example above). All of the other mappings in table 1 should stay.


Solution

  • I fear this might succumb to multiple common members, and its iteratively appending to dat1 is a bit inefficient at scale, but ...

    members1 <- split(dat1$Member, dat1$Group)
    members2 <- split(dat2$Member, dat2$Group)
    for (G in unique(dat1$Group)) {
      m1 <- dat1$Member[dat1$Group == G]
      incommon <- lengths(lapply(members2, function(m2) intersect(dat1$Member[dat1$Group == G], m2)))
      move2 <- subset(dat2, Group %in% names(members2)[incommon > 0] & !Member %in% m1)
      if (nrow(move2)) {
        dat1 <- rbind(dat1, transform(move2, Group = G))
        dat2 <- subset(dat2, !Group %in% names(members2)[incommon > 0])
      }
    }
    
    rbind(dat1, dat2)
    #    Group Member
    # 1      A      1
    # 2      A      2
    # 3      A      3
    # 4      C      5
    # 31     A      4
    # 41     D      7
    

    Data

    dat1 <- structure(list(Group = c("A", "A", "A", "C"), Member = c(1L, 2L, 3L, 5L)), class = "data.frame", row.names = c(NA, -4L))
    dat2 <- structure(list(Group = c("B", "B", "B", "D"), Member = c(2L, 3L, 4L, 7L)), class = "data.frame", row.names = c(NA, -4L))