Search code examples
rsparse-matrixigraphadjacency-matrix

How to calculate the number of common neighbours for all edges in a graph?


I have a network defined by a list of edges. The network is large and sparse. For each pair of connected vertices, I would like to calculate the number of common neighbours. This post discusses how to do this for a single pair of vertices, but it strikes me as inefficient to loop over all edges to calculate this statistic for each edge in the graph. Instead, the statistic I'm after can be calculated from the product of the adjacency matrix with itself, as follows:

library(igraph)
library(data.table)
set.seed(1111)

E <- data.table(i = sample(as.character(1:5e4), 1e5, replace = T),
                j = sample(as.character(1:5e4), 1e5, replace = T))
G <- simplify(graph_from_data_frame(E, directed = F))  # remove loops and multiples
N <- as_adjacency_matrix(G) %*% as_adjacency_matrix(G)

However, I don't know how to efficiently get the information out of the resulting matrix N, without looping over all the cells, which would look like this:

extract_entries <- function(x, M) {
 
  nl <- M@p[x] + 1  # index from 1, not 0
  nu <- M@p[x+1]
  j.col <- M@Dimnames[[1]][M@i[nl:nu] + 1]
  i.col <- M@Dimnames[[2]][x]
  nb.col <- M@x[nl:nu]
  
  data.table(i = i.col, j = j.col, nb = nb.col)
  
}
  
system.time(E.nb <- rbindlist(lapply(1:N@Dim[1], extract_entries, N), fill = T))

   # user  system elapsed 
  #  8.29    0.02    8.31 

E <- E.nb[E, on = c('i', 'j')][is.na(nb), nb := 0]

Even in the reproducible example above, looping is slow, and the true graph might have millions of vertices and tens of millions of edges. My end goal is to add a column to the data frame E with the number of common neighbours for each edge, as illustrated in the MWE.

My question is: is there a (much) more efficient way of extracting the number of common neighbours for each pair of vertices and merging this information back into the list of edges?

I have seen that the package diagramme_R includes a function that calculates the number of common neighbours, however it again appears intended to be used for a limited number of edges, and wouldn't solve the problem of adding the information on the number of common neighbours back to the original data frame.


Solution

  • You're pretty much there. Just a couple things: converting N to a triangular matrix lets us build E.nb without lapply. Also, the i-j ordering is messing up the E.nb[E join. Temporarily sorting each row fixes this.

    I've also included a function that uses the igraph triangles function instead of squaring the adjacency matrix, which is a bit faster on this example dataset.

    library(igraph)
    library(data.table)
    library(Matrix)
    set.seed(1111)
    
    E <- data.table(i = sample(as.character(1:5e4), 1e5, replace = TRUE),
                    j = sample(as.character(1:5e4), 1e5, replace = TRUE))
    
    f1 <- function(E) {
      blnSort <- E[[1]] > E[[2]]
      E[blnSort, 2:1 := .SD, .SDcols = 1:2]
      G <- simplify(graph_from_data_frame(E, directed = FALSE))  # remove loops and multiples
      N <- as(tril(as_adjacency_matrix(G) %*% as_adjacency_matrix(G), -1), "dtTMatrix")
      data.table(
        i = N@Dimnames[[1]][N@i + 1],
        j = N@Dimnames[[2]][N@j + 1],
        nb = as.integer(N@x)
      )[
        i > j, 2:1 := .(i, j)
      ][
        E, on = .(i, j)
      ][
        is.na(nb), nb := 0L
      ][
        blnSort, 2:1 := .(i, j)
      ]
    }
    
    f2 <- function(E) {
      blnSort <- E[[1]] > E[[2]]
      E[blnSort, 2:1 := .SD, .SDcols = 1:2]
      G <- simplify(graph_from_data_frame(E, directed = FALSE))  # remove loops and multiples
      tri <- matrix(triangles(G), ncol = 3, byrow = TRUE)
      data.table(
        i = names(V(G)[tri[,c(1, 1, 2)]]),
        j = names(V(G)[tri[,c(2, 3, 3)]])
      )[
        i > j, 2:1 := .(i, j)
      ][
        ,.(nb = .N), .(i, j)
      ][
        E, on = .(i, j)
      ][
        is.na(nb), nb := 0L
      ][
        blnSort, 2:1 := .(i, j)
      ]
    }
    
    microbenchmark::microbenchmark(f1 = f1(EE),
                                   f2 = f2(EE),
                                   setup = {EE <- copy(E)})
    #> Unit: milliseconds
    #>  expr      min       lq     mean   median       uq      max neval
    #>    f1 257.4803 281.8928 325.0267 303.4441 370.4977 478.7524   100
    #>    f2 123.5213 139.5152 169.3914 151.3065 190.7800 284.2644   100
    identical(f1(copy(E)), f2(copy(E)))
    #> [1] TRUE