Search code examples
pythonpandaslambdaapplyfuzzywuzzy

How to combine the fuzzy function with apply(lambda x: ) function?


I have 2 dataframes df1 and df2 like this:

df1:

Id    Name      
1    Tuy Hòa      
2    Kiến thụy 
3    Bình Tân

df2:

code    name      
A1     Tuy Hoà    
A2     Kiến Thụy  
A3     Tân Bình 

Now when I use merge:

out_df = pd.merge(df1, df2, left_on=['Name'], right_on=['name'], how='outer')

Of course the result is null, because example for Tuy Hòa <> Tuy Hoà (problem is the position sign "`"),... Then I use fuzzy function and apply lambda but it's not correct at all :(

My function:

def checker(Name,names):
    correct_name = ''
    correct_ratio = 0
    if Name in names:
        ratio = 100
    else:  
        try:
            x = process.extractOne(Name, names, scorer=fuzz.token_set_ratio, score_cutoff = 80)
            correct_name = x[0]
            #correct_ratio = x[1]
        except:
            correct_name = ''
            #correct_ratio = 0
    return correct_name

Then I convert:

names = df2['name'].tolist()

After that I use apply to loop through, compare and output the correct (df1 must base on df2 to correct)

df1['Name'] = df1['Name'].apply(lambda x: checker(x,names) if not checker(x,names) else x)

The output now is:

df1

Id    Name      
1    Tuy Hòa      
2    Kiến Thụy 
3    Tân Bình

Seem like Tuy Hòa and Kiến Thụy are correct but Bình Tân and Tân Bình are different name! so changing position charaters in this case is correct not error in typing :( The expect output should be:

df1

Id    Name      
1    Tuy Hòa      
2    Kiến Thụy 
3    Bình Tân

I try to use fuzz.ratio with score_cutoff = 55 but some case this will miss pairs

Any solution is better than my solution? (maybe not using fuzzy library?) Please help me. Many thanks to your help <3


Solution

  • Update: the strategy is the same than my previous answer but the algorithm has been replaced from Levenshtein distance to Damerau–Levenshtein distance.

    I slightly modified your input data to better understanding:

    >>> df1
        Id        Name
    11   1     Tuy Hòa  # match 18 'Tuy Hoà'
    12   2   Kiến thụy  # match 16 'Kiến Thụy'
    13   3    Bình Tân  # no match
    14   4  Louis Paul  # no match
    
    >>> df2
       code       name
    16   A2  Kiến Thụy
    17   A3   Tân Bình
    18   A1    Tuy Hoà
    19   A4   John Doe
    

    Use the Damerau–Levenshtein distance with pyxDamerauLevenshtein package to compute the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. (src: Wikipedia).

    # pip install pyxDamerauLevenshtein
    from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs
    
    seqs = df1['Name'].apply(lambda s: damerau_levenshtein_distance_seqs(s, df2['name'])
    
    >>> seqs
    11    [8, 6, 2, 7]  # 11 <-> 18 (2 operations)
    12    [1, 7, 8, 7]  # 12 <-> 16 (1 operation)
    13    [7, 6, 7, 6]  # 13 <-> 17 or 19 (6 operations)
    14    [9, 9, 8, 8]  # 14 <-> 18 or 19 (8 operations)
    Name: Name, dtype: object
    

    Apply merge with a maximum operations allowed:

    MAXOPS = 3
    
    df1['other'] = seqs.apply(lambda x: df2.index[x.index(min(x))]
                                  if min(x) <= MAXOPS else pd.NA)
    out = pd.merge(df1, df2, left_on='other', right_index=True, how='outer')
    
    >>> out
           Id        Name other code       name
    11.0  1.0     Tuy Hòa    18   A1    Tuy Hoà
    12.0  2.0   Kiến thụy    16   A2  Kiến Thụy
    13.0  3.0    Bình Tân  <NA>  NaN        NaN
    14.0  4.0  Louis Paul  <NA>  NaN        NaN
    NaN   NaN         NaN    17   A3   Tân Bình
    NaN   NaN         NaN    19   A4   John Doe
    

    Now you are free to drop all columns that you don't need, keep one name, etc.