Search code examples
algorithmstanford-nlplevenshtein-distance

Why the definition of edit Distance algorithm in Stanford NLP course plus 2 not 1


I am studying the Stanford NLP course by the following slides:https://web.stanford.edu/class/cs124/lec/med.pdf. The definition of edit Distance algorithm in this slide as following:

Initialization

D(i,0) = i
D(0,j) = j

Recurrence Relation:

 For each  i = 1…M
    For each  j = 1…N


       D(i,j)= min  {D(i-1,j) + 1, D(i,j-1) + 1, 
                     D(i-1,j-1) +   2(if X(i) ≠ Y(j) )  
                                    0(if X(i) = Y(j))}

why D(i-1,j-1) + 2 not (+1), if X(i) ≠ Y(j). I found the definition of edit Distance algorithm in the wikipedia is '+1':https://en.wikipedia.org/wiki/Levenshtein_distance. Could you guys explain the difference of these two definition, please. I am a new guy to NLP. Thanks!


Solution

  • When editing one string, what is the minimal number of changes you need to do in order to get another string?

    This is a general, not specific, definition for editing distance. To get a precise definition, you need to define what a "change" is.

    • If a "change" includes "replacing one letter by another", you have +1 in your definition
    • If a "change" doesn't include "replacing one letter by another", you have +2 in your definition

    Example: how many changes do you need to turn feh into fah?

    • One change - just replace e by a
    • Two changes - remove e; then add a in the same place

    Both answers are useful, and lead to two slightly different definitions of editing distance.