I have 2 dataframes df1 and df2 like this:
df1:
Id Name
1 Tuy Hòa
2 Kiến thụy
3 Bình Tân
df2:
code name
A1 Tuy Hoà
A2 Kiến Thụy
A3 Tân Bình
Now when I use merge:
out_df = pd.merge(df1, df2, left_on=['Name'], right_on=['name'], how='outer')
Of course the result is null, because example for Tuy Hòa
<> Tuy Hoà
(problem is the position sign "`"),...
Then I use fuzzy function and apply lambda but it's not correct at all :(
My function:
def checker(Name,names):
correct_name = ''
correct_ratio = 0
if Name in names:
ratio = 100
else:
try:
x = process.extractOne(Name, names, scorer=fuzz.token_set_ratio, score_cutoff = 80)
correct_name = x[0]
#correct_ratio = x[1]
except:
correct_name = ''
#correct_ratio = 0
return correct_name
Then I convert:
names = df2['name'].tolist()
After that I use apply to loop through, compare and output the correct (df1 must base on df2 to correct)
df1['Name'] = df1['Name'].apply(lambda x: checker(x,names) if not checker(x,names) else x)
The output now is:
df1
Id Name
1 Tuy Hòa
2 Kiến Thụy
3 Tân Bình
Seem like Tuy Hòa
and Kiến Thụy
are correct but Bình Tân
and Tân Bình
are different name! so changing position charaters in this case is correct not error in typing :( The expect output should be:
df1
Id Name
1 Tuy Hòa
2 Kiến Thụy
3 Bình Tân
I try to use fuzz.ratio
with score_cutoff = 55
but some case this will miss pairs
Any solution is better than my solution? (maybe not using fuzzy library?) Please help me. Many thanks to your help <3
Update: the strategy is the same than my previous answer but the algorithm has been replaced from Levenshtein distance to Damerau–Levenshtein distance.
I slightly modified your input data to better understanding:
>>> df1
Id Name
11 1 Tuy Hòa # match 18 'Tuy Hoà'
12 2 Kiến thụy # match 16 'Kiến Thụy'
13 3 Bình Tân # no match
14 4 Louis Paul # no match
>>> df2
code name
16 A2 Kiến Thụy
17 A3 Tân Bình
18 A1 Tuy Hoà
19 A4 John Doe
Use the Damerau–Levenshtein distance with pyxDamerauLevenshtein
package to compute the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. (src: Wikipedia).
# pip install pyxDamerauLevenshtein
from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs
seqs = df1['Name'].apply(lambda s: damerau_levenshtein_distance_seqs(s, df2['name'])
>>> seqs
11 [8, 6, 2, 7] # 11 <-> 18 (2 operations)
12 [1, 7, 8, 7] # 12 <-> 16 (1 operation)
13 [7, 6, 7, 6] # 13 <-> 17 or 19 (6 operations)
14 [9, 9, 8, 8] # 14 <-> 18 or 19 (8 operations)
Name: Name, dtype: object
Apply merge
with a maximum operations allowed:
MAXOPS = 3
df1['other'] = seqs.apply(lambda x: df2.index[x.index(min(x))]
if min(x) <= MAXOPS else pd.NA)
out = pd.merge(df1, df2, left_on='other', right_index=True, how='outer')
>>> out
Id Name other code name
11.0 1.0 Tuy Hòa 18 A1 Tuy Hoà
12.0 2.0 Kiến thụy 16 A2 Kiến Thụy
13.0 3.0 Bình Tân <NA> NaN NaN
14.0 4.0 Louis Paul <NA> NaN NaN
NaN NaN NaN 17 A3 Tân Bình
NaN NaN NaN 19 A4 John Doe
Now you are free to drop all columns that you don't need, keep one name, etc.