In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example
echo levenshtein('à', 'a');
it returns "2". In this case only one replacement has to be done, so I expect it to return "1".
Am I missing something?
The default PHP levenshtein()
, like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.
There is no multibyte version (i.e. mb_levenshtein()
) so you have two options:
1) Re-implement the function yourself, using mb_
functions. Possible example code from a Gist:
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.