Search code examples
rlarge-data

Find common third on large data set


I have a large dataframe like

df <- data.frame(group= c("a","a","b","b","b","c"),
             person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)

df
    group person
1     a    Tom
2     a  Jerry
3     b    Tom
4     b   Anna
5     b    Sam
6     c    Nic

and would like to get as a result

df.output
  pers1 pers2 person_in_common
1  Anna Jerry              Tom
2 Jerry   Sam              Tom
3   Sam   Tom             Anna
4  Anna   Tom              Sam
6  Anna   Sam              Tom

The result dataframe gives basically a table with all pairs of persons who have another person in common. I found a way to do it in SQL but it takes an awfully long time so I wonder if there is a efficient way to do it in R


Solution

  • Here's one using igraph package. The basic idea is to create a graph and then extract two adjacent nodes for each node.

    library(igraph)
    X1 = split(df$person, df$group)
    X2 = X1[lengths(X1) >= 2]
    dat = data.frame(do.call(rbind, unlist(lapply(X2, function(x)
                combn(x, 2, sort, FALSE)), recursive = FALSE)))
    g = graph.data.frame(dat, directed = FALSE)
    mydf = data.frame(as.matrix(get.adjacency(g)))
    mydf = mydf[colSums(mydf) > 1]
    ANS = sapply(mydf, function(x) t(combn(row.names(mydf)[which(x == 1)], 2)))
    do.call(rbind, lapply(names(ANS), function(nm) data.frame(ANS[[nm]], nm)))
    #     X1   X2   nm
    #1   Sam  Tom Anna
    #2  Anna  Tom  Sam
    #3 Jerry Anna  Tom
    #4 Jerry  Sam  Tom
    #5  Anna  Sam  Tom
    

    OR

    mynames = unique(do.call(c, X2))
    do.call(rbind,
            lapply(mynames, function(x){
                L = V(g)$name[unlist(adjacent_vertices(graph = g, v = x))]
                if(length(L) >= 2){
                    setNames(data.frame(t(combn(L, 2)), x), c("P1", "P2", "P3"))
                }else{
                    setNames(data.frame(NA, NA, x), c("P1", "P2", "P3"))
                }
            }))
    #     P1   P2    P3
    #1 Jerry Anna   Tom
    #2 Jerry  Sam   Tom
    #3  Anna  Sam   Tom
    #4  <NA> <NA> Jerry
    #5   Sam  Tom  Anna
    #6  Anna  Tom   Sam