Search code examples
algorithmdynamic-programminglevenshtein-distance

Can every solution of the edit distance problem be shown on the matrix obtained from the dynamic programming algorithm?


I watched a video on Youtube explaining how to calculate which operations are used and how many times by going back to 1 from the last cell of the matrix obtained from the dynamic programming algorithm. Although I understood the example in the video, when I tried to do it with other examples, I couldn't. Is it possible to show every solution on the matrix? I also added the code that calculates the matrix at the end.

Let's assume our words are "apple" and "paper".

Matrix obtained from dynamic programming algorithm (I used Levensthein Distance operations.):

Minimum edit distance of "apple" and "paper" is 4.

One of the solutions is 4 replace operations.

1. apple -> ppple (replace "a" with "p")

2. ppple -> paple (replace "p" with "a")

3. paple -> papee (replace "l" with "e")

4. papee -> paper (replace "e" with "r")

Solution on matrix:

enter image description here

Is it possible to show other solutions on this matrix?

For example:

1. apple -> papple (insert "p")

2. papple -> paple (remove "p")

3. paple -> pape (remove "l")

4. pape -> paper (insert "r")

code:

def min_edit_count(word1, word2):

    word1 = ' ' + word1     #
    word2 = ' ' + word2     # add a space before original words

    len_w1 = len(word1)     #
    len_w2 = len(word2)     # calculate the lengths of new words

    edit_matrix = np.zeros((len_w2, len_w1), dtype = int)
    # create a matrix with all zeros

    edit_matrix[0, :] = range(len_w1)  
    # assign numbers from 0 to len_w1 in the first row of the edit_matrix 
    edit_matrix[:, 0] = range(len_w2)
    # assign numbers from 0 to len_w2 in the first column of the edit_matrix 

    for i in range(1, len_w2):
        for j in range(1, len_w1):
            # edit_matrix[i-1][j] --> remove
            # edit_matrix[i][j-1] --> insert
            # edit_matrix[i-1][j-1] --> replace

            temp1 = edit_matrix[i-1][j] + 1
            temp2 = edit_matrix[i][j-1] + 1
            # add 1 to edit_matrix[i-1][j] and edit_matrix[i][j-1]

            temp3 = edit_matrix[i-1][j-1] + 1 if word1[j] != word2[i] else edit_matrix[i-1][j-1]
            # if last characters are same don't add 1 to edit_matrix[i-1][j-1].
            # no need to replace

            edit_count = min(temp1, temp2, temp3)
            # find min between three numbers
            edit_matrix[i][j] = edit_count

    min_edit = int(edit_matrix[len_w2 - 1, len_w1 - 1])
    # minimum edit count is the last number calculated
    
    return min_edit, edit_matrix

Solution

  • Based on Sorin's good answer, here is a slightly enhanced version that is fixed to fit your indexing and also prints the edit operations that need to be done in each case:

    def get_solutions(edit_matrix, word1, word2):
      pos = []
      sol = []
      def backtrack(i,j):
        pos.insert(0, (i,j))
        if i==0 and j==0:
          # This is a solution
          print(str(pos) + ": " + str(sol))
        if i>0 and edit_matrix[i-1,j] + 1 == edit_matrix[i,j]:
          sol.insert(0, "insert " + word2[i])
          backtrack(i-1,j)
          sol.pop(0)
        if j>0 and edit_matrix[i,j-1] + 1 == edit_matrix[i,j]:
          sol.insert(0, "remove " + word1[j])
          backtrack(i, j-1)
          sol.pop(0);
        if i>0 and j>0 and edit_matrix[i-1][j-1] + 1 if word1[j] != word2[i] else edit_matrix[i-1][j-1] == edit_matrix[i,j]:
          if (word1[j] != word2[i]):
            sol.insert(0, "replace " + word1[j] + " with " + word2[i])
          else:
            sol.insert(0, "skip")
          backtrack(i-1,j-1)
          sol.pop(0)
        pos.pop(0)
      word1 = ' ' + word1
      word2 = ' ' + word2
      backtrack(len(word2) - 1,len(word1) - 1)
    

    An example output for

    count, matrix = min_edit_count("apple", "paper")
    get_solutions(matrix, "apple", "paper")
    

    then looks as follows:

    [(0, 0), (1, 0), (2, 1), (3, 2), (3, 3), (3, 4), (4, 5), (5, 5)]: ['insert p', 'skip', 'skip', 'remove p', 'remove l', 'skip', 'insert r']
    [(0, 0), (0, 1), (1, 2), (2, 2), (3, 3), (3, 4), (4, 5), (5, 5)]: ['remove a', 'skip', 'insert a', 'skip', 'remove l', 'skip', 'insert r']
    [(0, 0), (1, 0), (2, 1), (2, 2), (3, 3), (3, 4), (4, 5), (5, 5)]: ['insert p', 'skip', 'remove p', 'skip', 'remove l', 'skip', 'insert r']
    [(0, 0), (1, 1), (2, 2), (3, 3), (3, 4), (4, 5), (5, 5)]: ['replace a with p', 'replace p with a', 'skip', 'remove l', 'skip', 'insert r']
    [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 5)]: ['remove a', 'skip', 'replace p with a', 'replace l with p', 'skip', 'insert r']
    [(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (5, 5)]: ['insert p', 'skip', 'skip', 'replace p with e', 'replace l with r', 'remove e']
    [(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (4, 4), (5, 5)]: ['insert p', 'skip', 'skip', 'replace p with e', 'remove l', 'replace e with r']
    [(0, 0), (1, 0), (2, 1), (3, 2), (3, 3), (4, 4), (5, 5)]: ['insert p', 'skip', 'skip', 'remove p', 'replace l with e', 'replace e with r']
    [(0, 0), (0, 1), (1, 2), (2, 2), (3, 3), (4, 4), (5, 5)]: ['remove a', 'skip', 'insert a', 'skip', 'replace l with e', 'replace e with r']
    [(0, 0), (1, 0), (2, 1), (2, 2), (3, 3), (4, 4), (5, 5)]: ['insert p', 'skip', 'remove p', 'skip', 'replace l with e', 'replace e with r']
    [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]: ['replace a with p', 'replace p with a', 'skip', 'replace l with e', 'replace e with r']
    

    //fun: you can try to see why every line has the same length :D