Search code examples
pythonpython-3.xpandasfor-loopsimilarity

Iterating over 2 columns and comparing similarities in Python


I have a DF that looks like this:

Row      Account_Name_HGI           company_name_Ignite
1        00150042 plc               WAGON PLC
2        01 telecom, ltd.           01 TELECOM LTD
3        0404 investments limited   0404 Investments Ltd

what I am trying to do is to iterate through the Account_Name_HGI and the company_name_Ignite columns and compare the 2 strings in row 1 and provide me with a similarity score. I have got the code that provides the score:

from difflib import SequenceMatcher

def similar(a, b):
     return SequenceMatcher(None, a, b).ratio()

And that brings the similarity score that I want but I am having an issue with the logic on how to create a for loop that will iterate over the 2 columns and return the similarity score. Any help will be appreciated.


Solution

  • Use list comprehension with zipping both columns:

    from difflib import SequenceMatcher
    
    df['ratio'] = [SequenceMatcher(None, a, b).ratio()
                   for a, b 
                   in zip(df['Account_Name_HGI'], df['company_name_Ignite'])]
    
    print (df)
       Row          Account_Name_HGI   company_name_Ignite     ratio
    0    1              00150042 plc             WAGON PLC  0.095238
    1    2          01 telecom, ltd.        01 TELECOM LTD  0.266667
    2    3  0404 investments limited  0404 Investments Ltd  0.818182