I am trying to improve an existing javascript levenstein distance calculation source code to generate a martix with not only value of the current setps, but also with actions taken (insert, replace, delete or match)
I get the wrong results in an "actions" matrix:
in the algorithm we see that (not js, from wikipedia):
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
So in my JS code I do the following:
// Step 6
d[i][j] = Minimum(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost);
// a deletion
if(d[i][j] == d[i - 1][j] + 1) {
actions[i][j] = 'D';
}
// a insertion
if(d[i][j] == d[i][j - 1] + 1) {
actions[i][j] = 'I';
}
// a substitution
if(d[i][j] == d[i - 1][j - 1] + cost) {
actions[i][j] = 'R';
}
A d
matrix (two-dimensional array) is for values, and it get's populated with correct values.
But why the corresponding actions
matrix displays not what logically algorithm would do?
What am I doing wrong with respect to assigning them 'I', 'R', 'D'? Or is it correct and just doesn't seem logical to me, since I thought in the aforeshown scenario, an insertion would take place in the second step.
BTW, is it actually sensible to generate such "actions" matrix in case of a Levenstein algorithm?
There usually are many ways to generate a set of "actions" for any given Levensthein matrix. In your example you can trace back the resulting cost matrix always to the minimun and you'll find quite a few paths.
Here are some examples:
(0,0)(0,1)(1,2)(1,3)(2,4)(3,5)
(0,0)(1,1)(1,2)(1,3)(2,4)(3,5)
(0,0)(0,1)(0,2)(1,3)(2,4)(3,5)
So I could find at least three different interpretations of the same distance matrix. This means, unless you have some way to prefer directions (e.g. prefer substitutions over deletions over insertions), your matrix will be highly ambiguous.
Now to the algorithm you proposed for filling up the action matrix: In your case you are implicitly prefering substituions (because they are checked last and will override previous choices) over insertions and insertions over deletions. That's where all the R
s in your matrix come from. Let's see what happens here:
The proposed solution when we prefer substitutions is to insert A
and N
before anything else then replace M
by N
, A
by A
and X
by S
. If you check you can see that this will have a cost of four (two insertions and two "real" substitutions), which is exactly what the matrix determined (this is the last path in the paths I traced down).
Now checking your action matrix again, what we find is, if we trace back from the final corner: R
, R
and R
in places (3,5)
, (2,4)
and (1,3)
. This corresponds to the final substituion of MAX
to NAS
. What is missing here however is the insertion of the leading AN
which I traced out above. Looking at the matrix, one can see that there are numbers in the first row and colum, not actions. These however should be deletions and substitutions respectively in which case you can produce the final sequence SSRRR
which has a cost of four for turning MAX
into ANNAS
.
However you should be aware, that it is not really necessary to compute the actions in a matrix like you did, because all information will be available in the final cost matrix. You can always trace back the final cost matrix from the last corner to the first and you will be able to reconstruct all paths that can turn one word into another. However once you have fixed the actions in the action matrix, there will only be one path left of all the possibilities.
This has to do a lot with the cost being well and uniquely defined whereas the paths may be highly ambiguous.
EDIT
Here is a full action matrix for the paths, which includes ambiguities:
* D D D
I R R/D D
I R/I R/I R
I R/I R/I R/I
I R/I R R/I/D
I R/I I R