Search code examples
stringalgorithmlevenshtein-distancesimilarityedit-distance

Algorithm to find edit distance to all substrings


Given 2 strings s and t. I need to find for each substring in s edit distance(Levenshtein distance) to t. Actually I need to know for each i position in s what is the minimum edit distance for all substrings started at position i.

For example:

t = "ab"    
s = "sdabcb"

And I need to get something like:

{2,1,0,2,2}

Explanation:

1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2

2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1

3th position:
distance("ab", "ab") = 0
...
minimum is 0

and so on.

I can use brute force algorithm to solve this task, of course. But is there faster algorithm?

Thanks for help.


Solution

  • The Wagner-Fischer algorithm gives you the answer for all prefixes "for free".

    http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

    The last row of the Wagner-Fischer matrix contains the edit distance from each prefix of s to t.

    So as a first crack at your problem, for each i, run Wagner-Fischer and select the smallest element in the last row.

    I will be curious to see if anyone else knows (or can find) a better approach.