Search code examples
levenshtein-distanceedit-distance

Levenshtein distance with substitution, deletion and insertion count


There's a great blog post here https://davedelong.com/blog/2015/12/01/edit-distance-and-edit-steps/ on Levenshtein distance. I'm trying to implement this to also include counts of subs, dels and ins when returning the Levenshtein distance. Just running a smell check on my algorithm.

def get_levenshtein_w_counts(s1: str, s2: str):
    row_dim = len(s1) + 1  # +1 for empty string
    height_dim = len(s2) + 1

    # tuple = [ins, del, subs]
    # Moving across row is insertion
    # Moving down column is deletion
    # Moving diagonal is sub
    matrix = [[[n, 0, 0] for n in range(row_dim)] for m in range(height_dim)]

    for i in range(1, height_dim):
        matrix[i][0][1] = i

    for y in range(1, height_dim):
        for x in range(1, row_dim):
            left_scores = matrix[y][x - 1].copy()
            above_scores = matrix[y - 1][x].copy()
            diagonal_scores = matrix[y - 1][x - 1].copy()

            scores = [sum_list(left_scores), sum_list(diagonal_scores), sum_list(above_scores)]
            min_idx = scores.index(min(scores))

            if min_idx == 0:
                matrix[y][x] = left_scores
                matrix[y][x][0] += 1
            elif min_idx == 1:
                matrix[y][x] = diagonal_scores
                matrix[y][x][2] += (s1[x-1] != s2[y-1])
            else:
                matrix[y][x] = above_scores
                matrix[y][x][1] += 1

    return matrix[-1][-1]

So according to the blog post if you make a matrix where the row is the first word + and empty str and the column is the 2nd word plus an empty string. You store the edit distance at each index. Then you get the smallest from the left, above and diagonal. If the min is diagonal then you know you're just adding 1 sub, if the min is from the left then you're just adding 1 insertion. If the min is from above then you're just deleting 1 character.

I think I did something wrong cause get_levenshtein_w_counts("Frank", "Fran") returned [3, 2, 2]


Solution

  • The problem was that Python does address passing for objects so I should be cloning the lists to the variables rather than doing a direct reference.