Search code examples
rmatrixlevenshtein-distancemapply

Calculate levenshteinDist between rownames and colnames using mapply


I want to calculate levenshteinDist distance between the rownames and colnames of a matrix using mapply function: Because the volume of may matrix is too big and using a nested loop "for" take a very long time to give me the result.

Here's the old code with nested loop:

mymatrix  <- matrix(NA, nrow=ncol(dataframe),ncol=ncol(dataframe),dimnames=list(colnames(dataframe),colnames(dataframe)))
distfunction = function (text1, text2) {return(1 - (levenshteinDist(text1, text2)/max(nchar(text1), nchar(text2))))}
for(i in 1:ncol(mymatrix))
{
  for(j in 1:nrow(mymatrix))

   mymatrix[i,j]=(distfunction(rownames(mymatrix)[i], colnames(mymatrix)[j]))*100
 }

I tried to switch nested loop by mapply:

   mapply(distfunction,mymatrix)

It gave me this error:

   Error in typeof(str2) : argument "text2" is missing, with no default

I planned to apply the levenshteinDist distance to my matrix and then conclude how to apply myfunction.

Is it possible?

Thank you.


Solution

  • The function mapply cannot be used in this context. It requires two input vectors and the function is applied to the first elements, second elements, .. and so on. But you want all combinations applied.

    You could try a stacked sapply

    sapply(colnames(mymatrix), function(col) 
      sapply(rownames(mymatrix), function(row) 
        distfunction(row, col)))*100
    

    Simple usage example

    sapply(1:3, function(x) sapply(1:4, function(y) x*y))
    

    Output:

         [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    2    4    6
    [3,]    3    6    9
    [4,]    4    8   12
    

    Update

    Even better is to use outer but i think your distfunction is not vectorized (due to the max). So use the wrapper function Vectorize:

    distfunction_vec <- Vectorize(distfunction)
    outer(rownames(mymatrix), rownames(mymatrix), distfunction_vec)
    

    But I'm not sure about the performance penalty. Better to directly vectorize the function (probably with pmax).