Search code examples
pythonpandasmergefuzzy-searchrapidfuzz

Pandas fast fuzzy match


I have two data frames with the following format:

d = {'id2': ['1', '2'], 'name': ['paris city', 'london town']}
    df1 = pd.DataFrame(data=d)
   
print(df1)
     id2   name 
0     1  paris city              
1     1  london town   


d = {'id2': ['3', '4'], 'name': ['parid cit', 'londoon town']}
    df2 = pd.DataFrame(data=d)
   
print(df2)
     id2   name 
0     3  parid cit              
1     4  londoon town     

I am looking to do a fuzzy string merge based on the name columns. The end result would be the following:

     id2  id2    name_x          name_y      match_level
0     3    1   parid cit       paris city         0.91
1     4    2   londoon town    london town        0.93

I have tried to use the fuzzy wuzzy library, but because both the tables I am merging have 100k+ rows, the code would take days to run.

I have also seen some 'fast' implementations like rapid fuzz or others that use tfidf or k nearest neighbors: However all of the code examples show how to find matches for a single string against a list, i've have not been able to find any other libraries/implemntation that do a fuzzy merge between two dataframes.

What is a fast and efficient way to fuzzy merge pandas dataframes?


Solution

  • import fuzzymatcher
    import pandas as pd
    
    df_left =  pd.DataFrame({'id2': ['1', '2'], 'name': ['paris city', 'london town']})
    
    df_right =  pd.DataFrame({'id2': ['3', '4'], 'name': ['parid cit', 'londoon town']})
    
    fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on = "name", right_on = "name")
    

    This is the address:https://github.com/RobinL/fuzzymatcher