Search code examples
ruby-on-railsstringcomparesimilaritylevenshtein-distance

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)


So, I started with this: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby

Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app.

So, is there another, maybe less stack intensive method of finding the similarity between two large strings?

Alternatively, I'd need a way to make the stack have much larger size. (I don't think this is the right way to solve the problem, though)


Solution

  • Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:

    def levenshtein(s1, s2)
      d = {}
      (0..s1.size).each do |row|
        d[[row, 0]] = row
      end
      (0..s2.size).each do |col|
        d[[0, col]] = col
        end
      (1..s1.size).each do |i|
        (1..s2.size).each do |j|
          cost = 0
          if (s1[i-1] != s2[j-1])
            cost = 1
          end
          d[[i, j]] = [d[[i - 1, j]] + 1,
                       d[[i, j - 1]] + 1,
                       d[[i - 1, j - 1]] + cost
                      ].min
        end
      end
      return d[[s1.size, s2.size]]
    end