How to combine the fuzzy function with apply(lambda x: ) function?

I have 2 dataframes df1 and df2 like this:

df1:

Id    Name      
1    Tuy Hòa      
2    Kiến thụy 
3    Bình Tân

df2:

code    name      
A1     Tuy Hoà    
A2     Kiến Thụy  
A3     Tân Bình

Now when I use merge:

out_df = pd.merge(df1, df2, left_on=['Name'], right_on=['name'], how='outer')

Of course the result is null, because example for Tuy Hòa <> Tuy Hoà (problem is the position sign "`"),... Then I use fuzzy function and apply lambda but it's not correct at all :(

My function:

def checker(Name,names):
    correct_name = ''
    correct_ratio = 0
    if Name in names:
        ratio = 100
    else:  
        try:
            x = process.extractOne(Name, names, scorer=fuzz.token_set_ratio, score_cutoff = 80)
            correct_name = x[0]
            #correct_ratio = x[1]
        except:
            correct_name = ''
            #correct_ratio = 0
    return correct_name

Then I convert:

names = df2['name'].tolist()

After that I use apply to loop through, compare and output the correct (df1 must base on df2 to correct)

df1['Name'] = df1['Name'].apply(lambda x: checker(x,names) if not checker(x,names) else x)

The output now is:

df1

Id    Name      
1    Tuy Hòa      
2    Kiến Thụy 
3    Tân Bình

Seem like Tuy Hòa and Kiến Thụy are correct but Bình Tân and Tân Bình are different name! so changing position charaters in this case is correct not error in typing :( The expect output should be:

df1

Id    Name      
1    Tuy Hòa      
2    Kiến Thụy 
3    Bình Tân

I try to use fuzz.ratio with score_cutoff = 55 but some case this will miss pairs

Any solution is better than my solution? (maybe not using fuzzy library?) Please help me. Many thanks to your help <3

Solution

Update: the strategy is the same than my previous answer but the algorithm has been replaced from Levenshtein distance to Damerau–Levenshtein distance.

I slightly modified your input data to better understanding:

>>> df1
    Id        Name
11   1     Tuy Hòa  # match 18 'Tuy Hoà'
12   2   Kiến thụy  # match 16 'Kiến Thụy'
13   3    Bình Tân  # no match
14   4  Louis Paul  # no match

>>> df2
   code       name
16   A2  Kiến Thụy
17   A3   Tân Bình
18   A1    Tuy Hoà
19   A4   John Doe

Use the Damerau–Levenshtein distance with pyxDamerauLevenshtein package to compute the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. (src: Wikipedia).

# pip install pyxDamerauLevenshtein
from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs

seqs = df1['Name'].apply(lambda s: damerau_levenshtein_distance_seqs(s, df2['name'])

>>> seqs
11    [8, 6, 2, 7]  # 11 <-> 18 (2 operations)
12    [1, 7, 8, 7]  # 12 <-> 16 (1 operation)
13    [7, 6, 7, 6]  # 13 <-> 17 or 19 (6 operations)
14    [9, 9, 8, 8]  # 14 <-> 18 or 19 (8 operations)
Name: Name, dtype: object

Apply merge with a maximum operations allowed:

MAXOPS = 3

df1['other'] = seqs.apply(lambda x: df2.index[x.index(min(x))]
                              if min(x) <= MAXOPS else pd.NA)
out = pd.merge(df1, df2, left_on='other', right_index=True, how='outer')

>>> out
       Id        Name other code       name
11.0  1.0     Tuy Hòa    18   A1    Tuy Hoà
12.0  2.0   Kiến thụy    16   A2  Kiến Thụy
13.0  3.0    Bình Tân  <NA>  NaN        NaN
14.0  4.0  Louis Paul  <NA>  NaN        NaN
NaN   NaN         NaN    17   A3   Tân Bình
NaN   NaN         NaN    19   A4   John Doe

Now you are free to drop all columns that you don't need, keep one name, etc.