Search code examples
phplevenshtein-distance

What does cost mean in php's levenshtein function to compare strings?


I'm studying php's levenshtein function to create a search in a small redis instance to get matches even if there are typos in the submitted search term. While the most of it is quite self explaining, I'm struggling to get how to best use the three different cost parameters.

int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )

There's a short explanation in the documentation

A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.

But that doesn't solve my not understanding. Can someone explain how I can use cost parameters to improve the results/performance?


Solution

  • In Machine Learning, a cost function is the function that you're trying to minimize to achieve the best result. When the machine execute, for instance, Steps A, B and C, the Cost Function will calculate how much it 'costed' to execute those steps. The term cost is connected to a mathematical function that will evaluate the performance of your result.

    When computer, for instance, execute the Steps B, C and A, the cost function will tell you if you got a better or worst result than the previous step.

    Reading the documentation, you can see that

    The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2.

    That is your cost function: Minimize the distance in terms of replace, delete and insert.

    Each time the algorithm has to perform one of those tasks, it adds up one cost of that operation. Those last 3 parameters lets you decide the value of each operation. At the end of the comparison, you will get a final value, namely cost of the function. If that value is smaller than a defined threshold, it's the same as assuming the function returned true. In case the result is bigger than your threshold, it means it costs more than what you allow for one string be equal to another.