I need to speed up a process of finding most optimal distance for each entry. I am using gower.dist
from StatMatch
and solve_LSAP
from the clue
package. The gower distance takes no time at all, however the LSAP solver takes too long with the number of times I need to run it.
Is there a way to make this run faster using parallel computing or just making part of it run in parallel [link to clue github] [link to scientific journal discussing this] or another solver that I may be unaware of that is faster? The other two libraries I am aware of are adagio
and RcppHungarian
(both are slower).
Example data: Gower Distance Data (google drive link to folder with data)
> dim(gowerdist)
[1] 4309 10366
solve_LSAP(gowerdist, maximum = FALSE)
clue::solve_LSAP
uses the Hungarian algorithm. The Jonker and Volgenant (1987) algorithm, implemented in TreeDist::LAPJV()
, is more efficient.
I tested a reduced version of the matrix to get results in seconds rather than minutes; differences in run time are likely to increase with larger matrices.
gower <- read.csv("GowerDistance.csv")
dim(gower)
gowerMat <- as.matrix(gower)
gow1 <- gowerMat[1:2400, 1:4800]
tictoc::tic()
clue <- clue::solve_LSAP(gow1, maximum = FALSE)
tictoc::toc()
# 67.95 sec elapsed
tictoc::tic()
td <- TreeDist::LAPJV(gow1)
tictoc::toc()
# 20.24 sec elapsed
GraphAlignment::LinearAssignment()
on Bioconductor also uses the LAPJV algorithm, but can only be applied to square matrices. (A workaround is to add extra rows to the matrix with extremely high values.)
Another alternative is lpSolve; this does not specify which algorithm it uses, but is much slower
tictoc::tic()
lps <- lpSolve::lp.assign(gowerMat[1:800, 1:1600]) # much smaller matrix
tictoc::toc()
# 364.67 sec elapsed