Search code examples
phplevenshtein-distance

Is the PHP levenshtein() function buggy?


On this page levenshtein(), I am using the example #1 with following variables:

// input misspelled word
$input = 'htc corporation';

// array of words to check against
$words = array('htc', 'Sprint Nextel', 'Sprint', 'banana', 'orange',
        'radish', 'carrot', 'pea', 'bean');

Could someone please tell me why the expected result is carrot rather than htc? Thanks


Solution

  • Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

    here is a simple analysis

    $input = 'htc corporation';
    
    // array of words to check against
    $words = array(
        'htc',
        'Sprint Nextel',
        'Sprint',
        'banana',
        'orange',
        'radish',
        'carrot',
        'pea',
        'bean' 
    );
    
    foreach ( $words as $word ) {
    
        // Check for Intercept
        $ic = array_intersect(str_split($input), str_split($word));
    
        printf("%s \t l= %s , s = %s , c = %d \n",$word ,  
        levenshtein($input, $word), 
        similar_text($input, $word), 
        count($ic));
    }
    

    Output

    htc      l= 12 , s = 3 , c = 5 
    Sprint Nextel    l= 14 , s = 3 , c = 8 
    Sprint   l= 12 , s = 1 , c = 7 
    banana   l= 14 , s = 2 , c = 2 
    orange   l= 12 , s = 4 , c = 7 
    radish   l= 12 , s = 3 , c = 5 
    carrot   l= 11 , s = 1 , c = 10  
    pea      l= 13 , s = 2 , c = 2 
    bean     l= 13 , s = 2 , c = 2 
    

    It clear htc has a distance of 12 while carrot has 11 if you want htc then Levenshtein alone is not enough .. you need to compare exact word then set priorities