Compare each row in column with every row in the same column and remove the row if match ratio is > 90 with fuzzy logic in python. I tried removing using duplicates, but there are some rows with same content with some extra information. The data is like below
print(df)
Output is :
Page no
0 Hello
2 Hey
3 Helloo
4 Heyy
5 Hellooo
I'm trying to compare each row with every row and remove if row matches the content with ratio greater than 90 using fuzzy logic. The expected output is :
Page no
0 Hello
2 Hey
The code i tried is :
def func(name):
matches = df.apply(lambda row: (fuzz.ratio(row['Content'], name) >= 90), axis=1)
print(matches)
return [i for i, x in enumerate(matches) if x]
func("Hey")
The above code only checks for one row with sentence Hey
Can anyone please help me with code? It would be really helpful
itertools.combinations
to get all combinations of valuesapply()
fuzz.ratio()
import pandas as pd
import io
import itertools
from fuzzywuzzy import fuzz
df = pd.read_csv(
io.StringIO(
""" Page_no
0 Hello
2 Hey
3 Helloo
4 Heyy
5 Hellooo"""
),
sep="\s+",
)
# find combinations that have greater than 80 match
dfx = pd.DataFrame(itertools.combinations(df["Page_no"].values, 2)).assign(
ratio=lambda d: d.apply(lambda t: fuzz.ratio(t[0], t[1]), axis=1)
).loc[lambda d: d["ratio"].gt(80)]
# exclude rows that have big match to another row...
df.loc[~df["Page_no"].isin(dfx[1])]
Page_no | |
---|---|
0 | Hello |
2 | Hey |