Search code examples
vbalevenshtein-distance

Debugging Levenshtein distance implementation - how is the min distance being calculated?


I have been trying to understand this code for while but I can't get how the distances are calculated. I know how the algorithm should work (it provides the edit distance in characters between two strings and I can do it on paper), but I don't understand how this line (as an example)

d(i, j) = d(i - 1, j - 1)

comes back as an integer. Same for min1, min2. We already define

d(i,0)

and

d(0,j)

, but how does it get a value when

d(i,j)

is not 0 in one of the two arguments?

Code:

Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer

l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)   
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function

?Levenshtein("saturday","sunday")

3


Solution

  • I don't have much clue about Levenshtein, however I can explain what's going on in that array, please see comments in the code:

    l1 = Len(s1) 'this sets a variable with the length of the first string
    l2 = Len(s2) 'this sets a variable with the length of the second string
    
    ReDim d(l1, l2) 'redimension the array to this size, though this can be seen as ReDim d(0 to l1, 0 to l2)
    
    For i = 0 To l1 'for i = 0 to length of first string
        d(i, 0) = i 'allocate the value of i to each row in the array, at col 0
    Next i
    
    For j = 0 To l2 'for j = 0 to length of second string
        d(0, j) = j 'allocate the value of j to each column in the array, at row 0
    Next j
    

    EDIT: I've added some debugging (will print to ActiveSheet). Blue for where the letters match, red for the others... though the colour doesn't matter.

    As far as I can tell, with each loop iteration, it builds a matrix of the differences, starting from comparing 1:1 first letters, till the last. Based on the loop position, and with each iteration, it will use previous position values to calculate the difference at current position.

    This is probably easier to follow with shorter strings.

    • Loop through each letter in second word, then loop through each letter in first word

    • at first outer loop (for each row), first inner loop (for each column), we have S meet S = 0. (d(i, j) = d(i - 1, j - 1) evaluates to d(i, j) = d(1 - 1, 1 - 1), i.e.: 0)

    • at first outer loop, second inner loop, we have S meet U = 1.

    • etc, etc.

    enter image description here

    Other than that, step through the code and see how the variables change based on the if conditions... not sure i can explain this any better.

    Public Function Levenshtein(s1 As String, s2 As String)
    
    Dim i As Integer, j As Integer
    Dim l1 As Integer, l2 As Integer
    Dim min1 As Integer, min2 As Integer
    Dim d() As Integer
    
    'For debugging purposes only
    Cells.Clear
    Dim rngOutput As Range: Set rngOutput = ActiveSheet.Range("A1").Resize(Len(s1) + 2, Len(s2) + 2)
    
    With rngOutput
        .ColumnWidth = 3
        .HorizontalAlignment = xlCenter
        .VerticalAlignment = xlCenter
    End With
    
    l1 = Len(s1): l2 = Len(s2): ReDim d(l1, l2)
    
    For i = 0 To l1
        d(i, 0) = i
        With rngOutput
            .Cells(i + 3, 1) = Mid(s1, i + 1, 1)
            If Not i = 0 Then .Cells(i + 2, 2) = i
        End With
    Next i
    
    For j = 0 To l2
        d(0, j) = j
        With rngOutput
            .Cells(1, j + 3) = Mid(s2, j + 1, 1)
            If Not j = 0 Then .Cells(2, j + 2) = j
        End With
    Next j
    
    For i = 1 To l1
        For j = 1 To l2
            If Mid(s1, i, 1) = Mid(s2, j, 1) Then
                d(i, j) = d(i - 1, j - 1)
    
                With rngOutput.Cells(i + 2, j + 2)
                    .Value = d(i, j)
                    .Font.Color = vbBlue
                End With
            Else
                min1 = d(i - 1, j) + 1
                min2 = d(i, j - 1) + 1
                If min2 < min1 Then
                    min1 = min2
                End If
                min2 = d(i - 1, j - 1) + 1
                If min2 < min1 Then
                    min1 = min2
                End If
                d(i, j) = min1
    
                With Cells(i + 2, j + 2)
                    .Value = d(i, j)
                    .Font.Color = vbRed
                End With
            End If
        Next
    Next
    
    Levenshtein = d(l1, l2)
    End Function