Search code examples
edit-distance

Detailed distance between words


How would I go about displaying detailed distance between words. For example, the output of the program could be:

Words are "car" and "cure":
Replace "a" with "u".
Add "e".

The Levenshtein distance does not fulfill my needs (I think).


Solution

  • Try the following. The algorithm is roughly following Wikipedia (Levenshtein distance). The language used below is ruby

    Use as an example, the case of changing s into t as follows:

    s = 'Sunday'
    t = 'Saturday'
    

    First, s and t are turned into arrays, and an empty string is inserted at the beginning. m will eventually be the matrix used in the argorithm.

    s = ['', *s.split('')]
    t = ['', *t.split('')]
    m = Array.new(s.length){[]}
    

    m here, however, is different from the matrix given if the algorithm in wikipedia for the fact that each cell includes not only the Levenshtein distance, but also the (non-)operation (starting, doing nothing, deletion, insertion, or substitution) that was used to get to that cell from an adjacent (left, up, or upper-left) cell. It may also include a string describing the parameters of the operation. That is, the format of each cell is:

    [Levenshtein distance, operation(, string)]

    Here is the main routine. It fills in the cells of m following the algorithm:

    s.each_with_index{|a, i| t.each_with_index{|b, j|
        m[i][j] =
        if i.zero?
            [j, "started"]
        elsif j.zero?
            [i, "started"]
        elsif a == b
            [m[i-1][j-1][0], "did nothing"]
        else
            del, ins, subs = m[i-1][j][0], m[i][j-1][0], m[i-1][j-1][0]
            case [del, ins, subs].min
            when del
                [del+1, "deleted", "'#{a}' at position #{i-1}"]
            when ins
                [ins+1, "inserted", "'#{b}' at position #{j-1}"]
            when subs
                [subs+1, "substituted", "'#{a}' at position #{i-1} with '#{b}'"]
            end
        end
    }}
    

    Now, we set i, j to the bottom-right corner of m and follow the steps backwards as we unshift the contents of the cell into an array called steps, until we reach the start.

    i, j = s.length-1, t.length-1
    steps = []
    loop do
        case m[i][j][1]
        when "started"
            break
        when "did nothing", "substituted"
            steps.unshift(m[i-=1][j-=1])
        when "deleted"
            steps.unshift(m[i-=1][j])
        when "inserted"
            steps.unshift(m[i][j-=1])
        end
    end
    

    Then we print the operation and the string of each step unless that is a non-operation.

    steps.each do |d, op, str=''|
        puts "#{op} #{str}" unless op == "did nothing" or op == "started"
    end
    

    With this particular example, it will output:

    inserted 'a' at position 1
    inserted 't' at position 2
    substituted 'n' at position 2 with 'r'