Search code examples
pythonalgorithmdynamic-programminglevenshtein-distanceedit-distance

Python Edit distance algorithm with dynamic programming and 2D Array - Output is not optimal


I have encountered the edit distance (Levenshtein distance) problem. I have looked at other similar stackoverflow questions, and is certain that my question is distinct from them - either from the language used or the approach.

I have used a 2D array that compares the two strings, and dynamic programming to store previous values. If i and j in the string indices match, it would output 0, as we don't need to do anything; else, the output is 1. It is as shown in the picture below, the orange arrow represents a match.

(Code below is edited after suggestions from answers)

def edit_distance(source, target):
    n = len(target)+1
    m = len(source)+1

    # let D denote the 2-dimensional array, m is the column, n is row
    D = [ [0]*m for _ in range(n)] 

    # the loop inside is the target string, we operate this
    # while the loop outside is the source string

    for j in range(0, m):
        for i in range(0, n):
            if target[i-1] == source[j-1]:
                # match, insertion and deletion, find the path with least move
                D[i][j] = min(D[i-1][j-1], D[i-1][j]+1, D[i][j-1]+1)

            else:
                # mismatch, insertion and deletion, find the path with least move
                D[i][j] = min(D[i-1][j-1]+1, D[i-1][j]+1, D[i][j-1]+1)
            
    return D[n-1][m-1]

print(edit_distance("distance", "editing"))

enter image description here

However, the final output was 8 in my code, while the optimal editing distance between the strings "editing" and "distance" should be 5, and I am very confused. Could you please help with it from the approach of dynamic programming?


Solution

  • Ah, looks like I have found a solution, that I'll have to answer my own question now. (I'm still confused with some parts, and am only answering this question to briefly introduce the new implementation, as to save the time of other kind helpers)

    So firstly, I have missed a condition in the original code, that is, what if one of the two string inputs are empty? Then we'll have to insert everything from the other string. Henceforth, the optimal editing distance is just the length of this other string.

        if i == 0:
            D[i][j] = j    
        elif j == 0:
            D[i][j] = i 
    

    Also, regarding the original for-loop of the code, I learnt my mistakes from GeeksforGeeks. If my understanding is correct, they are saying that if two indices (i and j) are consistent, all we need to do is to move diagonally upward on the graph (i-1 and j-1) without adding any counts.

    Else if the indices do not match, we move either to the direction of i-1, j-1 or diagonally up dependently. I was right on this, apart from the fact the count is added after the move, whereas I have added them during the move.

    I am still a bit unsure with how it worked, however I'll compare the two algorithms below, would be appreciated if someone could explain it further in the comments.

    My original for-loop (present in the question)

    for j in range(0, m):
        for i in range(0, n):
            if target[i-1] == source[j-1]:
                D[i][j] = min(D[i-1][j-1], D[i-1][j]+1, D[i][j-1]+1)
    
            else:
                D[i][j] = min(D[i-1][j-1]+1, D[i-1][j]+1, D[i][j-1]+1)
    

    And below is the new for-loop, whose output is correct after testing:

            if target[i-1] == source[j-1]:
                D[i][j] = D[i-1][j-1]
    
            else:
                D[i][j] = 1 + min(D[i][j-1], D[i-1][j], D[i-1][j-1])
    

    Would be appreciated if someone could further explain how did this work, as I still only have a superfacial understanding of the new code

    Final code:

    def edit_distance(target, source):
    
        m = len(target)+1
        n = len(source)+1
        D = [[0 for x in range(n)] for x in range(m)]
    
    
        for i in range(m):
            for j in range(n):
    
                if i == 0:
                    D[i][j] = j   
                elif j == 0:
                    D[i][j] = i 
    
                elif target[i-1] == source[j-1]:
                    D[i][j] = D[i-1][j-1]
                else:
                    D[i][j] = 1 + min(D[i][j-1], D[i-1][j], D[i-1][j-1])    
    
        return D[m-1][n-1]
    
    print(edit_distance("distance", "editing"))
    # output = 5, which is correct