Search code examples
pandasstring-matchingdifflib

Python 3.12 Pandas Difflib Get_Close_Matches to compare two strings in a dataframe and return a % match


Working with irregular Excel tables, I am trying to match questions by looking at a string in a column in a dataframe and if it is a close match to my target string, score the % match.

The way I tried to work on it is to create a new column that scores a % match and outputs the score:

import difflib
import pandas as pd

df = pd.read_excel('Filename.xlsx')

# Create a string to check for % match
x = 'This is my test question?'

# Compare the string to a string in the df in the column df.questions and create a % match score
df['Match_percent'] = difflib.get_close_matches(x.lower(), df.questions.astype(str).lower())[0].ratio()

Three things are wrong - Attribute Error: string object has no attribute 'ratio.'
Series has no attribute 'lower' for the second lowercase expression.
Removing "ratio()" and the second "lower()" seems to give me the best match but repeats in the entire dataframe.

Is there a way to just evaluate the string (in lowercase) in the corresponding row and a return a match score?


Solution

  • I believe you were trying to do a a ratio of the test string against each string in each row using difflib. There was no example input so I created my own.

    import difflib
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "questions": ["jiMmM?", "jAMmy?", "bobbY?", "no?"],
            "other_col": list("1111"),
        }
    )
    x = "JIMMYYY?"
    df["Match_percent"] = df.questions.apply(
        lambda _x: difflib.SequenceMatcher(None, _x.lower(), x.lower()).ratio()
    )
    

    difflib.get_close_matches() finds the closest match between a list of strings and a string; from the list of strings. SequenceMatcher is what you want.