Search code examples
javastringgroovylevenshtein-distancefuzzy-comparison

Why is the Levenshtein distance score so low for these two strings?


I am using a Levenshtein distance algorithm to find similar strings and I currently have my score for acceptance as 12 (because some of my strings have up to 5 words). But I was suprised to see the below two strings get a score of 11, they seem very different to me..

def string1 = "Facial Fuel"
def string2 = "Calendula Toner"

println("Result is ${distance(string1,string2)}");

def distance(String str1, String str2) {
   def dist = new int[str1.size() + 1][str2.size() + 1]
   (0..str1.size()).each { dist[it][0] = it }
   (0..str2.size()).each { dist[0][it] = it }

    (1..str1.size()).each { i ->
    (1..str2.size()).each { j ->
        dist[i][j] = [dist[i - 1][j] + 1, dist[i][j - 1] + 1, dist[i - 1][j - 1] + ((str1[i - 1] == str2[j - 1]) ? 0 : 1)].min()
       }
   }
   return dist[str1.size()][str2.size()]
}

Is something wrong with the algorithm here or is that the right score?


Solution

  • Your result is right.

        string1:   F a c i a     l   . F u   e l
        operation: S C S S S I I C I C S S I C S
        string2:   C a l e n d u l a . T o n e r
    

    ('.' means ' ')

    Operations are

        C : Copy
        S : Substitue
        I : Insert
        D : Delete
    

    Levenshtein distance is

        S * 7 + I * 4 = 11