Search code examples
pythonpandasloopsdataframesequencematcher

How to iterate through 2 columns and match one by one


Lets say i have 2 excel files each containing a column of names and dates

Excel 1:

Name
0      Bla bla bla June 04 2018 
1      Puppy Dog June 01 2017
2      Donald Duck February 24 2017
3      Bruno Venus April 24 2019

Excel 2:

                             Name
0        Pluto Feb 09 2019
1        Donald Glover Feb 22 2020
2        Dog Feb 22 2020
3        Bla Bla Feb 22 2020

I want to match each cell from column 1 to each cell in column 2 and then locate the biggest similarity.

The following function will give a percentage value of how much two input match each other.

SequenceMatcher code example:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()


x = "Adam Clausen a Feb 09 2019"
y = "Adam Clausen Feb 08 2019"
print(similar(x,y))

Output:0.92


Solution

  • If u know how to load colums as dataframe..this code should get your job done..

    from difflib import SequenceMatcher
    
    col_1 = ['potato','tomato', 'apple']
    col_2 = ['tomatoe','potatao','appel']
    
    def similar(a,b):
        ratio = SequenceMatcher(None, a, b).ratio()
        matches = a, b
        return ratio, matches
    
    for i in col_1:
        print(max(similar(i,j) for j in col_2))