Search code examples
phplevenshtein-distance

Using levenshtein to match target string + extra text


I'm working on a website conversion project, and I need to match inexact strings. I'm looking at using leveshtein, but I don't know what parameters I should set for my task.

Say I have a target string elephant. The match I would want to pull is elephant mouse, for example

<?

$target = "elephant";

$data = array(
  'elephant mouse',
  'rhinoceros',
  'alligator',
  'hippopotamus',
  'rat',
);

foreach ( $data as $datum ) {
  echo "$target >> $datum == " .  levenshtein($target, $datum) . "\n";
}

And I get the result

elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7

So while rhino and hippo are at 10, in my actual data set, I couldn't really tell the difference between elephant mouse, rat and alligator, which are neck-and-neck at 6 and 7. This is bogus data, but in my data set, words that are closer in length only get a much lower score than words that are target + extra.

How should I configure the options of levenshtein()? I can set new integer values for the cost of insertion, replacement, and deletion. What weighting will give me what I want?

(If you can think of a better title please edit my post).


Solution

  • The weighting levenshtein($target, $datum, 1, 10, 10) gives me

    elephant >> elephant mouse == 6
    elephant >> rhinoceros == 65
    elephant >> alligator == 52
    elephant >> hippopotamus == 64
    elephant >> rat == 60
    

    Which works very well :) Insertion is a low cost, while both replacement and deletion are high. This means that target + extra has a low score, where strings of equal or shorter length, but different characters, have a high cost.