Search code examples
rperformancelevenshtein-distance

R- adist taking too long to run


Currently am working with a dataset of approx 250k rows.The adist function of the utils packages runs for several hours(8+)

Code Used:

master <- read.csv("Master.csv",header=TRUE)
companies <- read.csv("Clean Companies.csv",header=TRUE)
dirty<- subset(master,select=c("Company"))
comp<- subset(companies,select=c("COMPANY.CLEAN"))

dim(dirty)
> 246774 1

#To test one can use:
#dirty = data.frame(name= c("ABC","*/BC","HO**E...","OFFi....ce"))
#comp = data.frame(info_names= c("ABC","HOME","OFFICE"))    


mat <- adist(dirty1[, 1], comp[, 1]);
data<-cbind.data.frame(orig=dirty[, 1], new=comp[apply(mat, 1, which.min), 1])

IS there a way to do this better?


Solution

  • I don't know the adist function too well, but you could parallize it with e.g. foreach, if the input vector in dirty can be used element wise: (iterators should help to reduce the memory usage)

    library(foreach)
    library(iterators)
    mat_for <- foreach(dirti = iter(dirty$name), .export = "comp", .combine = rbind) %do% {
    adist(dirti, comp[, 1])
    }
    

    you just have to choose a suitable parallel backend and change %do% with %dopar%

    Other parallel approaches using a parallel version of lapply (e.g. parLapply in the snow package) should also work:

    mat_lap <- Reduce(rbind, lapply(dirty$name, function(x, comp) adist(x, comp[, 1]), comp= comp))
    

    Test the parallel approach on a small subset of your data, then you can also see if the computation time decreases. With your example I got the same results:

    > all.equal(mat_lap, mat)
    [1] TRUE
    > all.equal(mat_for, mat)
    [1] TRUE