Search code examples
performancematlablevenshtein-distance

Optimize speed of Levenshtein distance of many words


I have a cell array dictionary which contains a lot of words (ca. 15000).

I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?

Here is my code (dict_keys is my cell array of length m):

1)

matrix = sparse(m,m);
for i = 1:m-1;
    matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys );
end

2)

matrix = sparse(m,m);
for i = 1:m-1;
  for j = i+1:m
     matrix(i,j) = strdist(dict_keys{i},dict_keys{j});
  end   
end

Solution

  • Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop.

    If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i in the second loop (which you are doing).

    Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.org.

    Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.

    Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist calls are completely independent of each other.