Search code examples
algorithmedit-distance

Edit distance explanation


I have seen a lot of code to solve that but I am not able to see why they are using a matrix to represent the distance between two words. Can any one please explain to me?

Here is a sample code I found:

public static int minDistance(String word1, String word2)
{
    int l1 = word1.length(), l2 = word2.length();

    int[][] d = new int[l1 + 1][l2 + 1];

    // the edit distance between an empty string and the prefixes of
    // word2
    for (int i = 0; i < l2 + 1; i++) {
        d[0][i] = i;
    }

    // the edit distance between an empty string and the prefixes of
    // word1
    for (int j = 0; j < l1 + 1; j++) {
        d[j][0] = j;
    }

    for (int i = 1; i < l1 + 1; i++) {
        for (int j = 1; j < l2 + 1; j++) {
            if (word1.charAt(i - 1) == word2.charAt(j - 1)) {
                d[i][j] = d[i - 1][j - 1];
            } else {
                d[i][j] = min(1 + d[i][j - 1], 1 + d[i - 1][j],
                1 + d[i - 1][j - 1]); // min of insertion,
                // deletion, replacement
            }
        }
    }

    return d[l1][l2];
}

Solution

  • The matrix contains [at the end] the edit distances between all the prefixes of word1 and all the prefixes of word2.

    d[i][j] = edit distance between word1[0..(i-1)] and word2[0..(j-1)]
    

    You are interested in d[l1][l2]. In general, to compute d[i][j], you need to look at the three smaller neighbours d[i-1][j], d[i-1][j-1] and d[i][j-1]. So transitively, d[i][j] requires all entries where at least one coordinate is smaller than i respj (and the other not larger). The exception is when two characters word1[i-1] and word2[j-1] are equal, in which case you only need the diagonal smaller neighbour.

    If you first fill the matrix with -1 to indicate the corresponding edit distance between the prefixes has not been evaluated, and recursively calculate d[l1][l2], using the cached value of a needed d[i][j] if it has already been computed, computing it recursively and storing its value if not, some regions of the matrix may remain untouched. Possibly large regions if there are many pairs of equal characters [only the diagonal will be evaluated if the two words are equal], only small regions if any if there are few pairs of equal characters.

    In the generic case, most of the matrix is needed to compute d[l1][l2], so then it is faster to compute the matrix completely using the simple algorithm than to recur and compute only the actually required values.

    If you don't store the values for the shorter prefixes, since they are transitively required to compute d[i][j], they would have to be recalculated for each way d[i-a][j-b] is reached from d[i][j]. Since d[i-a][j-b] can be reached in many ways from d[i][j], that would cause a lot of recalculation leading to a grossly inefficient algorithm in general.

    Since the computation for each row only uses the previous row, you could do with just two arrays of length min{l1, l2} + 1 to save some memory, but unless the words are really long, it doesn't make a big difference, and the code is simpler with the full array.