On this page levenshtein(), I am using the example #1 with following variables:
// input misspelled word
$input = 'htc corporation';
// array of words to check against
$words = array('htc', 'Sprint Nextel', 'Sprint', 'banana', 'orange',
'radish', 'carrot', 'pea', 'bean');
Could someone please tell me why the expected result is carrot rather than htc? Thanks
Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.
here is a simple analysis
$input = 'htc corporation';
// array of words to check against
$words = array(
'htc',
'Sprint Nextel',
'Sprint',
'banana',
'orange',
'radish',
'carrot',
'pea',
'bean'
);
foreach ( $words as $word ) {
// Check for Intercept
$ic = array_intersect(str_split($input), str_split($word));
printf("%s \t l= %s , s = %s , c = %d \n",$word ,
levenshtein($input, $word),
similar_text($input, $word),
count($ic));
}
Output
htc l= 12 , s = 3 , c = 5
Sprint Nextel l= 14 , s = 3 , c = 8
Sprint l= 12 , s = 1 , c = 7
banana l= 14 , s = 2 , c = 2
orange l= 12 , s = 4 , c = 7
radish l= 12 , s = 3 , c = 5
carrot l= 11 , s = 1 , c = 10
pea l= 13 , s = 2 , c = 2
bean l= 13 , s = 2 , c = 2
It clear htc has a distance of 12
while carrot has 11
if you want htc then Levenshtein
alone is not enough .. you need to compare exact word then set priorities