Search code examples
phpmysqldata-analysis

Similar names in a huge list


I have database of 50 000 + companies that is constantly updated (200+ per month).

The is a huge issue with repeated content because the names are not always strict/correct:
"Super 1 Store"
"Super One Store"
"Super 1 Stores"

Edit: another example .. which probably needs different approach:
"Amy's Pizza" <---> "Organic Pizza by Amy and Company"

We need tool to scan the data for similar names. I have some experience with Levenshtein Distance and LCS but they work nice for comparing if 2 strings are similar ...
Here I have to scan 50 000 names may be each-with-each and calculate there ... overall similarity rating ...

I need advice how to attack this problem the expected results is to have a list with 10-20 groups of very similar names, and may be further adjust the sensitivity for more results.


Solution

  • I had similar problem a year ago or so, and if i remember well, i solved (more or less) using similar_text and soundex as other people said in comments. Something like this:

    <?php
    
    $str1 = "Store 1 for you";
    $str2 = "Store One 4 You";
    
    similar_text(soundex($str1), soundex($str2), $percent);
    
    if ($percent >= 66){
        echo "Equal";
        //Send an email for review
    }else{
        echo "Different";
        //Proceed to insert in database
    }
    ?>
    

    In my case use a percent of 66% to determine the companies are the same (in this case do not insert into database but send an email to me to review, and check if is correct).

    After some months with this solutions, i decide to use some kind of unique code for the companies (CIF in my case because is unique by company here in Spain).