Search code examples
rstringperformancelevenshtein-distanceedit-distance

Optimize R code to create distance matrix based on customized distance function


I am trying to create a distance matrix (to use for clustering) for strings based on customized distance function. I ran the code on a list of 6000 words and it is still running since last 90 minutes. I have 8 GB RAM and Intel-i5, so the problem is with the code only. Here is my code:

library(stringdist)
#Calculate distance between two monograms/bigrams
stringdist2 <- function(word1, word2)
{
    #for bigrams - phrases with two words
    if (grepl(" ",word1)==TRUE) {
        #"Hello World" and "World Hello" are not so different for me
        d=min(stringdist(word1, word2),
        stringdist(word1, gsub(word2, 
                          pattern = "(.*) (.*)", 
                          repl="\\2,\\1")))
    }
    #for monograms(words)
    else{
        #add penalty of 5 points if first character is not same
        #brave and crave are more different than brave and bravery
        d=ifelse(substr(word1,1,1)==substr(word2,1,1),
                            stringdist(word1,word2),
                            stringdist(word1,word2)+5)
    }   
    d
}
#create distance matrix
stringdistmat2 = function(arr)
{
    mat = matrix(nrow = length(arr), ncol= length(arr))
    for (k in 1:(length(arr)-1))
    {
        for (j in k:(length(arr)-1))
        {           
            mat[j+1,k]  = stringdist2(arr[k],arr[j+1])      
        }
    }
    as.dist(mat)    
}

test = c("Hello World","World Hello", "Hello Word", "Cello Word")
mydmat = stringdistmat2(test)
> mydmat
  1 2 3
2 1    
3 1 2  
4 2 3 1

I think issue could be that I used loops instead of apply - but then I found at many places that loops are not that inefficient. More importantly I am not skilled enough to use apply for my loops are nested loops are like k in 1:n and j in k:n. I wonder if there are other things which can be optimized as well.


Solution

  • Interesting question. So going step by step:

    1 - stringdist function is already vectorized:

    #> stringdist("byye", c('bzyte','byte'))
    #[1] 2 1
    
    #> stringdist(c('doggy','gadgy'), 'dodgy')
    #[1] 1 2
    

    But giving two vectors with the same length, stringdist will result in looping parallelly on each vector (not resulting in a matrix with cross results), as Map would do:

    #> stringdist(c("byye","alllla"), c('bzyte','byte'))
    #[1] 2 6
    

    2 - Rewrite your function so that your new function keeps this vectorized feature:

    stringdistFast <- function(word1, word2)
    {
        d1 = stringdist(word1, word2)
        d2 = stringdist(word1, gsub("(.+) (.+)", "\\2 \\1", word2))
    
        ifelse(d1==d2,d1+5*(substr(d1,1,1)!=substr(d2,1,1)),pmin(d1,d2))
    }
    

    It is indeed working the same way:

    #> stringdistFast("byye", c('bzyte','byte'))
    #[1] 2 1
    
    #> stringdistFast("by ye", c('bzyte','byte','ye by'))
    #[1] 3 2 0
    

    3 - Rewrite the dismatrix function with only one loopy loop and only on a triangular part (no outer there, it's slow!):

    stringdistmatFast <- function(test)
    {
        m = diag(0, length(test))
        sapply(1:(length(test)-1), function(i)
        {
            m[,i] <<- c(rep(0,i), stringdistFast(test[i],test[(i+1):length(test)]))
        }) 
    
        `dimnames<-`(m + t(m), list(test,test))
    }
    

    4 - Use the function:

    #> stringdistmatFast(test)
    #            Hello World World Hello Hello Word Cello Word
    #Hello World           0           0          1          2
    #World Hello           0           0          1          2
    #Hello Word            1           1          0          1
    #Cello Word            2           2          1          0