Search code examples
pythonpandasdataframefuzzy-logicfuzzy-comparison

Compare each row in column with every row in the same column and remove the row if match ratio is > 90 with fuzzy logic in python


Compare each row in column with every row in the same column and remove the row if match ratio is > 90 with fuzzy logic in python. I tried removing using duplicates, but there are some rows with same content with some extra information. The data is like below

print(df)

Output is :

    Page no
0   Hello
2   Hey
3   Helloo
4   Heyy
5   Hellooo

I'm trying to compare each row with every row and remove if row matches the content with ratio greater than 90 using fuzzy logic. The expected output is :

    Page no
0   Hello
2   Hey

The code i tried is :

def func(name):
    matches = df.apply(lambda row: (fuzz.ratio(row['Content'], name) >= 90), axis=1)
    print(matches)
    return [i for i, x in enumerate(matches) if x]

func("Hey")

The above code only checks for one row with sentence Hey

Can anyone please help me with code? It would be really helpful


Solution

    • use itertools.combinations to get all combinations of values
    • then apply() fuzz.ratio()
    • analyse results and select rows that don't have a strong match to another combination
    import pandas as pd
    import io
    import itertools
    from fuzzywuzzy import fuzz
    
    df = pd.read_csv(
        io.StringIO(
            """    Page_no
    0   Hello
    2   Hey
    3   Helloo
    4   Heyy
    5   Hellooo"""
        ),
        sep="\s+",
    )
    
    # find combinations that have greater than 80 match
    dfx = pd.DataFrame(itertools.combinations(df["Page_no"].values, 2)).assign(
        ratio=lambda d: d.apply(lambda t: fuzz.ratio(t[0], t[1]), axis=1)
    ).loc[lambda d: d["ratio"].gt(80)]
    
    # exclude rows that have big match to another row...
    df.loc[~df["Page_no"].isin(dfx[1])]
    
    
    Page_no
    0 Hello
    2 Hey