I'm working on a website conversion project, and I need to match inexact strings. I'm looking at using leveshtein, but I don't know what parameters I should set for my task.
Say I have a target string elephant
. The match I would want to pull is elephant mouse
, for example
<?
$target = "elephant";
$data = array(
'elephant mouse',
'rhinoceros',
'alligator',
'hippopotamus',
'rat',
);
foreach ( $data as $datum ) {
echo "$target >> $datum == " . levenshtein($target, $datum) . "\n";
}
And I get the result
elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7
So while rhino
and hippo
are at 10, in my actual data set, I couldn't really tell the difference between elephant mouse
, rat
and alligator
, which are neck-and-neck at 6 and 7. This is bogus data, but in my data set, words that are closer in length only get a much lower score than words that are target + extra
.
How should I configure the options of levenshtein()
? I can set new integer values for the cost of insertion, replacement, and deletion. What weighting will give me what I want?
(If you can think of a better title please edit my post).
The weighting levenshtein($target, $datum, 1, 10, 10)
gives me
elephant >> elephant mouse == 6
elephant >> rhinoceros == 65
elephant >> alligator == 52
elephant >> hippopotamus == 64
elephant >> rat == 60
Which works very well :) Insertion is a low cost, while both replacement and deletion are high. This means that target + extra
has a low score, where strings of equal or shorter length, but different characters, have a high cost.