Search code examples
phpmatchinglevenshtein-distance

Match items from two sets of data by highest % of similarities


Task: I have two columns with product names. I need to find the most similar cell from Column B for Cell A1, then for A2, A3 and so on.

Input:

Col A | Col B
-------------
Red   | Blackwell
Black | Purple      
White | Whitewater     
Green | Reddit  

Output:

Red = Reddit / 66% similar

Black = Blackwell / 71% similar

White = Whitewater / 66% similar

Green = Reddit / 30% similar

I think Levenstein Distance can help with sorting, but I don't know how to apply it.

Thanks in advance, any piece of information helps.


Solution

  • Using nested loops

    <?php
    
    // Arrays of words
    $colA = ['Red', 'Black', 'White', 'Green'];
    $colB = ['Blackwell', 'Purple', 'Whitewater', 'Reddit'];
    
    // loop through words to find the closest
    foreach ($colA as $a) {
    
        // Current max number of matches
        $maxMatches = -1;
        $bestMatch = '';
    
        foreach ($colB as $b) {
    
            // Calculate the number of matches
            $matches = similar_text($a, $b, $percent);
    
            if ($matches > $maxMatches) {
    
                // Found a better match, update
                $maxMatches = $matches;
                $bestMatch = $b;
                $matchPercentage = $percent;
    
            }
    
        }
    
        echo "$a = $bestMatch / " . 
            number_format($matchPercentage, 2) . 
            "% similar\n";
    }
    

    The first loop iterates through the elements of the first array, for each it initializes the best match found and the number of matching characters on that match.

    The inner loop iterates through the array of possible matches looking for the best match, for each candidate it checks the similarities (you could use levenshtein here instead of similar_text but the later is convenient because it calculates the percentage for you), if the current word is a better match than the current best match that variable gets updated.

    For each word in the outer loop we echo the best match found and the percentage. Format as desired.