Search code examples
combinationsjulialevenshtein-distancesequence-alignment

Record all optimal sequence alignments when calculating Levenshtein distance in Julia


I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia.

It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner of the matrix.

I can record the pointer information of each d[i][j], but it might give me 3 directions to go back to d[i-1][j-1] for substitution, d[i-1][j] for deletion and d[i][j-1] for insertion. So I'm trying to get all combination of the operation sets that gave me the optimal Levenshtein distance.

It seems that I can store one operation set in one array, but I don't know the total number of all combinations as well as there length, so it would be hard for me to define an array to store the operation set during the enumeration process. How can I generate arrays while store the former ones? Or I should use Dataframe?


Solution

  • If you implement the Wagner-Fischer algorithm, at some point, you choose the minimum over three alternatives (see Wikipedia pseudo-code). At this point, you save the chosen alternative in another matrix. Using a statement like:

    c[i,j] = indmin([d[i-1,j]+1,d[i,j-1]+1,d[i-1,j-1]+1])
    # indmin returns the index of the minimum element in a collection.
    

    Now c[i,j] contains 1,2 or 3 according to deletion, insertion or substitution. At the end of the calculation, you have the final d matrix element achieving the minimum distance, you then follow the c matrix backwards and read the action at each step. Keeping track of i and j allows reading the exact substitution by looking which element was in string1 at i and string2 at j in the current step. Keeping a matrix like c cannot be avoided because at the end of the algorithm, the information about the intermediate choices (done by min) would be lost.