Search code examples
rparallel-processingassignment-problem

Linear sum assignment/Hungarian method performance in R


I need to speed up a process of finding most optimal distance for each entry. I am using gower.dist from StatMatch and solve_LSAP from the clue package. The gower distance takes no time at all, however the LSAP solver takes too long with the number of times I need to run it.

Is there a way to make this run faster using parallel computing or just making part of it run in parallel [link to clue github] [link to scientific journal discussing this] or another solver that I may be unaware of that is faster? The other two libraries I am aware of are adagio and RcppHungarian (both are slower).

Example data: Gower Distance Data (google drive link to folder with data)

> dim(gowerdist)
[1]  4309 10366
solve_LSAP(gowerdist, maximum = FALSE)

Solution

  • clue::solve_LSAP uses the Hungarian algorithm. The Jonker and Volgenant (1987) algorithm, implemented in TreeDist::LAPJV(), is more efficient.

    I tested a reduced version of the matrix to get results in seconds rather than minutes; differences in run time are likely to increase with larger matrices.

    gower <- read.csv("GowerDistance.csv")
    dim(gower)
    gowerMat <- as.matrix(gower)
    gow1 <- gowerMat[1:2400, 1:4800]
    
    tictoc::tic()
    clue <- clue::solve_LSAP(gow1, maximum = FALSE)
    tictoc::toc()
    # 67.95 sec elapsed
    
    tictoc::tic()
    td <- TreeDist::LAPJV(gow1)
    tictoc::toc()
    # 20.24 sec elapsed
    

    GraphAlignment::LinearAssignment() on Bioconductor also uses the LAPJV algorithm, but can only be applied to square matrices. (A workaround is to add extra rows to the matrix with extremely high values.)

    Another alternative is lpSolve; this does not specify which algorithm it uses, but is much slower

    tictoc::tic()
    lps <- lpSolve::lp.assign(gowerMat[1:800, 1:1600]) # much smaller matrix
    tictoc::toc()
    # 364.67 sec elapsed