I have a cell array dictionary which contains a lot of words (ca. 15000).
I want to compute the function strdist
(to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?
Here is my code (dict_keys is my cell array of length m):
1)
matrix = sparse(m,m);
for i = 1:m-1;
matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys );
end
2)
matrix = sparse(m,m);
for i = 1:m-1;
for j = i+1:m
matrix(i,j) = strdist(dict_keys{i},dict_keys{j});
end
end
Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun
internally just expands into a loop.
If strdist
is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i
in the second loop (which you are doing).
Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.org.
Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.
Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist
calls are completely independent of each other.