I have two dataframes of different size but common columns, with ID numbers as key column.
I want to compare the column "names" in both dataframes.
But sometimes the "names" does not match because there are upper cases or there are additional uncommon words.
Is there a way to "match" the column with something like a "tolerance", 70% matching words?
Here is an example:
import pandas as pd
data1 = {'key_column': ['1457', '2356', '2268', '1497','2298'],
'names': ['Birds are flying', 'Flowers', 'pink bird', 'pink bird', 'Sun Beach'],
'random_column_in_df1':['1', '2', '2', '2', '2'],
}
data2 = {'key_column': ['2268', '2356', '2298', '1497'],
'names': ['bird', 'flowers here', 'Sun','some text'],
'random_column_in_df2':['1', '3', '2', '3'],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1
df2
expected output:
You can use TheFuzz
; Is a package that implements Levenshtein distance in python.
from thefuzz import fuzz
df3 = df1.merge(df2, how='inner', on='key_column', suffixes=['_df1', '_df2'])
threshold = 50 # Value range: 0-100
row_selection = df3.apply(lambda row: fuzz.ratio(row['names_df1'], row['names_df2']) >= threshold, axis=1)
df3 = df3.iloc[row_selection.values]
df3.head(10)