Search code examples
pythonpython-3.xpandasfor-loopsimilarity

How to find best string match out of multiple possibilities in a dataframe?


I have a DF that looks like this:

    Row      Master                     Option1                  Option2
    1        00150042 plc               WAGON PLC                wegin llp
    2        01 telecom, ltd.           01 TELECOM LTD           telecom 1
    3        0404 investments limited   0404 Investments Ltd     404 Limited Investments

What I am trying to do is to compare the option1 and option2 columns to the master columns separately and obtain a similarity score for each.

I have got the code that provides the score:

    from difflib import SequenceMatcher

    def similar(a, b):
         return SequenceMatcher(None, a, b).ratio()

What I need help with is for the logic on how to implement this.

Is it a for loop that will iterate over the Option1 and the master columns, get the score saved on a new column called Option1_score, and then do the same thing with the Option2 column?

Any help is highly appreciated!


Solution

  • With the dataframe you provided:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "Row": [1, 2, 3],
            "Master": ["00150042 plc", "01 telecom, ltd.", "0404 investments limited"],
            "Option1": ["WAGON PLC", "01 TELECOM LTD", "0404 Investments Ltd"],
            "Option2": ["wegin llp", "telecom 1", "404 Limited Investments"],
        }
    )
    

    Here is one way to do it with Python f-strings and Pandas apply:

    for col in ["Option1", "Option2"]:
        df[f"{col}_score(%)"] = df.apply(
            lambda x: round(similar(x["Master"], x[col]) * 100, 1), axis=1
        )
    

    Then:

    print(df)
    # Output
       Row                    Master               Option1  \
    0    1              00150042 plc             WAGON PLC   
    1    2          01 telecom, ltd.        01 TELECOM LTD   
    2    3  0404 investments limited  0404 Investments Ltd   
    
                       Option2  Option1_score(%)  Option2_score(%)  
    0                wegin llp               9.5              19.0  
    1                telecom 1              26.7              64.0  
    2  404 Limited Investments              81.8              63.8