Search code examples
roptimizationdistancercpp

Fast parallel bipartite distance calculation in R


What is the fastest calculation for bipartite distance in R with a parallelized Rcpp backend?

parallelDist is a great package with a cpp backend and support for multi-threading, but does not support bipartite distance calculations (to my knowledge).

Using parallelDist() for bipartite distance matrix computation. This involves calculating m1:m1 and m2:m2 in addition to m1:m2 -- highly inefficient.

library(parallelDist)

bipartiteDist <- function(matrix1,matrix2){
  matrix12 <- rbind(matrix1,matrix2)
  d <- parallelDist(matrix12)
  d <- as.matrix(d)[(1:nrow(matrix1)),((nrow(matrix1)+1):(nrow(matrix1)*2))]
  d
}

matrix1 <- abs(matrix(rnorm(1000),10,100000))
matrix2 <- abs(matrix(rnorm(1000),10,100000))

dist <- bipartiteDist(matrix1, matrix2)

This approach is faster than pDist or a pure R implementation when more than 3 cores are available.

pdist is perfect for computing bipartite distances, but does not support multithreading.

Any fast implementations for parallelized bipartite distance computation?


Solution

  • The wordspace dist.matrix() function supports parallelized calculation of bipartite distances.

    Benchmarking wordspace against parallelDist

    matrix1 <- abs(matrix(rnorm(1000),100,100000))
    matrix2 <- abs(matrix(rnorm(1000),100,100000))
    
    library(rbenchmark)
    library(parallelDist)
    library(wordspace)
    
    bipartiteDist_parallelDist <- function(matrix1,matrix2){
      matrix12 <- rbind(matrix1,matrix2)
      d <- parallelDist(matrix12, method = "euclidean")
      d <- as.matrix(d)[(1:nrow(matrix1)),((nrow(matrix1)+1):(nrow(matrix1)*2))]
      d
    }
    
    bipartiteDist_wordspace <- function(matrix1,matrix2){
      wordspace.openmp(threads = wordspace.openmp()$max)
      dist.matrix(matrix1,matrix2, byrow = TRUE, method = "euclidean", convert = FALSE)
    }
    
    benchmark("parallelDist" = {
                bd1 <- bipartiteDist_parallelDist(matrix1,matrix2)
              },
              "wordspace" = {
                bd2 <- bipartiteDist_wordspace(matrix1,matrix2)
              },
              replications = 1,
              columns = c("test", "replications", "elapsed",
                          "relative", "user.self", "sys.self"))
    
    plot(bd1,bd2) # yes, both methods give near-identical results
    

    Benchmarking results:

              test replications elapsed relative user.self sys.self
    1 parallelDist            1   2.120   12.184   126.145    0.523
    2    wordspace            1   0.174    1.000     3.749    0.252
    

    I used 80 threads.

    A framework for further speed gains

    The author of wordspace admits to emphasizing low memory load over speed, thus additional speed gains are possible (source).

    For instance, here is a general framework for Euclidean distance:

    bipartiteDist3 <- function(matrix1,matrix2){
      m1tm2 <- tcrossprod(matrix1,matrix2)
      sq1 <- rowSums(matrix1^2)
      sq2 <- rowSums(matrix2^2)
      out0 <- outer(sq1, sq2, "+") - 2 * m1tm2
      sqrt(out0)
    }
    

    I am very interested in a parallelized solution optimized for sparse matrices. To my knowledge, wordspace does not optimize for sparsity. For example, there are parallelizable sparse matrix implementations of tcrossprod, rowSums, and outer function equivalents.