Search code examples
pythonpandasstringdataframeseries

Finding similarity score between two columns using pandas


I have a dataframe like as shown below

ID,Region,Supplier,year,output
1,Test,Test1,2021,1
2,dummy,tUMMY,2022,1
3,dasho,MASHO,2022,1
4,dahp,ZYZE,2021,0
5,delphi,POQE,2021,1
6,kilby,Daasan,2021,1
7,sarby,abbas,2021,1

df = pd.read_clipboard(sep=',')

My objective is

a) To compare two column values and assign a similarity score.

So, I tried the below

import difflib
[(len(difflib.get_close_matches(x, df['Region'], cutoff=0.6))>1)*1
 for x in df['Supplier']]

However, this gives all output to be '0'. Meaning less than cut-off value of 0.6

However, I expect my output to be like as shown below

enter image description here


Solution

  • Updated answer with similarity flag and score (using difflib.SequenceMatcher)

    cutoff = 0.6
    
    df['similarity_score'] = (
        df[['Region','Supplier']]
        .apply(lambda x: difflib.SequenceMatcher(None, x[0].lower(), x[1].lower()).ratio(), axis=1)
    )
    
    df['similarity_flag'] = (df['similarity_score'] >= cutoff).astype(int)
    

    Output:

       ID  Region Supplier  year  output  similarity_score  similarity_flag
    0   1    Test    Test1  2021       1          0.888889                1
    1   2   dummy    tUMMY  2022       1          0.800000                1
    2   3   dasho    MASHO  2022       1          0.800000                1
    3   4    dahp     ZYZE  2021       0          0.000000                0
    4   5  delphi     POQE  2021       1          0.200000                0
    5   6   kilby   Daasan  2021       1          0.000000                0
    6   7   sarby    abbas  2021       1          0.200000                0
    

    Try using apply with lambda and axis=1:

    df['similarity_flag'] = (
        df[['Region','Supplier']]
        .apply(lambda x: len(difflib.get_close_matches(x[0].lower(), [x[1].lower()])), axis=1)
    )
    

    Output:

       ID  Region Supplier  year  output  similarity_flag
    0   1    Test    Test1  2021       1                1
    1   2   dummy    tUMMY  2022       1                1
    2   3   dasho    MASHO  2022       1                1
    3   4    dahp     ZYZE  2021       0                0
    4   5  delphi     POQE  2021       1                0
    5   6   kilby   Daasan  2021       1                0
    6   7   sarby    abbas  2021       1                0