Search code examples
algorithmlevenshtein-distance

Question on Levenshtein distance


1) Why do we add 1 on these line?

    d[i-1, j] + 1, // deletion 
    d[i, j-1] + 1, // insertion 

The line

if s[i] = t[j] then cost := 0

        else cost := 1 

should take into account deleted/lower word lengths, or am I missing something?

2) Also, the comments state deletion and insertion. Am I right in thinking that it's checking for deleted characters in both words (the integers j/i representing the length of words), because a lower value will represent deleted characters.

The code used is here (because it is pseudo code and I have no language specific issues, this thread is not in any language category):

http://www.iterasi.net/openviewer.aspx?sqrlitid=z0cloj7xhk-ce0f72v4cjq


Solution

  • 1) These lines compute the distance in the case of deletion, in the case of insertion, and the one using "cost" in case of a substitution...

    deletion and insertion effectively count as "1" in the distance calculation, hence the +1.

    We can believe there was a substitution only if the characters are different hence the "cost=0" if both chars are equal...

    The new distance is then the minimum distance between these 3 hypothesis so you don't always add 1 ...

    2) if I compute the distance between "FooBar" and "FoBaWhatever" I have some character deletions even if the second string is longer than the first one ...

    Of course if the second string is shorter than the second ( FooBar -> FoBa ) I will find some deletions but cannot know in advance where they are...