Search code examples
phpfuzzy-logicfuzzywuzzy

Rearrange words using Levenshtein distance


Summary

I am trying to find name matching percentage in php but before that I need to rearrange the words in string according to 1st string.

What is the source code about?

I have two strings. First I am adding both strings to array if space is found in string add it into array. $arraydataBaseName and $arraybankData from my first array i.e $arraydataBaseName I am searching all the values of $arraybankData and getting the Key. I am getting the key arrangement properly but unable to arrange the value at their specific places into new array.

$dataBaseName = "Jardine Lloyd Thompson";
$bankdata = "Thompson Thompson Jardine"; 

$replacedataBaseName = preg_replace("#[\s]+#", " ", $dataBaseName);
$replacebankData = preg_replace("#[\s]+#", " ", $bankdata); 

$arraydataBaseName = explode(" ",$replacedataBaseName);
$arraybankData = explode(" ",$replacebankData); 

echo "<br/>";
print_r($arraydataBaseName);

$a="";
$i="";
$arraysize =  count($arraydataBaseName);

$push=array();
for($i=0;$i< $arraysize;$i++)
{     
  if(array_search($arraybankData[$i],$arraydataBaseName)>0)
  {
    ${"$a$i"} =  array_search($arraybankData[$i],$arraydataBaseName); 
    //echo ${"$a$i"};
    array_push($push,${"$a$i"});
   }    
 }
 print_r($push); 

Case 1:

Input

DatabaseName = Jardine Lloyd Thompson

BankName = Thompson Jardine Lloyd

Output

ExpectedOutput = Jardine Lloyd Thompson

Case 2:##

Input

DatabaseName = Jardine Lloyd Thompson

BankName = Thoapson Jordine Llayd

If the words are not found in the above DatabaseName then the expected search would be based on leventish algorithm word which have less distance that would be considered as the key

Output

ExpectedOutput = Jordine Llayd Thoapson

Description of Problem

Question Update

When the user input $bankdata contains more words remaining unmatchable, I need to append those to the end.


Solution

  • This is a simple version, finding the best match word by word subsequently.

    declare (strict_types=1);
    
    $dataBaseName = 'Jardine Lloyd Thompson';
    
    $bankdataRows =
    [
      'Thompson Jardine Lloyd',
      'Blaaa  Llayd Thoapson   f***ing user input   Jordine   aso. ',
    ];
    
    // assume the "database" is already stored trimmed since it is server-side controlled
    $dbWords = preg_split("#[\s]+#", $dataBaseName);
    
    foreach ($bankdataRows as $bankdata)
    {
      // here we trim the data received from client-side.
      $bankWords = preg_split("#[\s]+#", trim($bankdata));
      $result    = [];
    
      if(!empty($bankWords))
        foreach ($dbWords as $dbWord)
        {
          $idx   = null;
          $least = PHP_INT_MAX;
    
          foreach ($bankWords as $k => $bankWord)
            if (($lv = levenshtein($bankWord, $dbWord)) < $least)
            {
              $least = $lv;
              $idx   = $k;
            }
    
          $result[] = $bankWords[$idx];
          unset($bankWords[$idx]);
        }
    
      $result = array_merge($result, $bankWords);
      var_dump($result);
    }
    

    result

    array(3) {
      [0] =>
      string(7) "Jardine"
      [1] =>
      string(5) "Lloyd"
      [2] =>
      string(8) "Thompson"
    }
    
    array(8) {
      [0] =>
      string(7) "Jordine"
      [1] =>
      string(5) "Llayd"
      [2] =>
      string(8) "Thoapson"
      [3] =>
      string(5) "Blaaa"
      [4] =>
      string(7) "f***ing"
      [5] =>
      string(4) "user"
      [6] =>
      string(5) "input"
      [7] =>
      string(4) "aso."
    }
    

    See live fiddle

    You might want to extend this approach first calculating the Levenshtein distance of each possible combination and then select the best entire match.