Search code examples
pythonpandaslevenshtein-distancefuzzywuzzy

Fuzzy Matching 2 DataFrames on multiple columns which includes one column with Float Values


I have 2 DataFrames namely 'Master_data_df' & 'My_records_df'. I am required to find out records which are missed out from 'Master_data_df' by comparing with 'My_records_df'.

Here the column 'Cleint_Name' is a string and there is no exact match in 2 dataframes. Similarly, the column 'Transaction_Value' is a float and again the values varies slightly.

Master_data_df = pd.DataFrame({'Client_Name': ['Royal DUTCH Shell', 'Royal Dutch Shell' , 
                'China State Construction Engineering', 'CHINA STATE CONSTRUCTION 
                  ENGINEERING'],
                  'Transaction_Value': [23455.25, 6782.67, 35672.76 , 1000.15]})

My_records_df = pd.DataFrame({'Client': ['Rayal Duch Shel', 'China National Petrolium', 
                'Arcellor Mittal' , 'China State Constrn Engg'],
                 'Value': [23455.98, 98426.32 , 45393.62, 35672.15})

I am looking for output as below. The Row with NaN value gives me records which are missing in My_records_df.

I am looking for answer as below

I have tried below:

import pandas as pd
import fuzzy_pandas as fpd
Final_Report_DF = fpd.fuzzy_merge(RMaster_data_df , My_records_df,
                        left_on= ['Client_Name' , 'Transaction_Value'],
                        right_on= ['Client' , 'Value'],
                        method= 'levenshtein',
                        threshold= 0.8)

This gives me a blank dataframe. (Note: fpd.fuzzy_merge does not allow me to use option 'join')

Request guidance in solving this.


Solution

  • You have to fix three things in above example:

    • Convert float to string (float comparison is a better way to find "closely matching" numbers than Levenshtein similarity score).
    • Tune threshold to a lower value like 0.6.
    • Add param join='left-outer'.

    Full example:

    Master_data_df = pd.DataFrame({
        'Client_Name': ['Royal DUTCH Shell', 'Royal Dutch Shell', 'China State Construction Engineering', 'CHINA STATE CONSTRUCTION ENGINEERING'],
        'Transaction_Value': [23455.25, 6782.67, 35672.76 , 1000.15]
        })
    
    My_records_df = pd.DataFrame({
        'Client': ['Rayal Duch Shel', 'China National Petrolium', 'Arcellor Mittal', 'China State Constrn Engg'],
        'Value': [23455.98, 98426.32 , 45393.62, 35672.15]
        })
    
    # Cast float to string to be used for "levenshtein" distance calculation.
    Master_data_df["Transaction_Value"] = Master_data_df["Transaction_Value"].astype(str)
    My_records_df["Value"] = My_records_df["Value"].astype(str)
    
    import pandas as pd
    import fuzzy_pandas as fpd
    Final_Report_DF = fpd.fuzzy_merge(Master_data_df, My_records_df,
                            left_on=['Client_Name', 'Transaction_Value'],
                            right_on=['Client', 'Value'],
                            method='levenshtein',
                            threshold=0.6,
                            join='left-outer')
    

    Output:

                                Client_Name Transaction_Value                    Client     Value
    0                     Royal DUTCH Shell          23455.25           Rayal Duch Shel  23455.98
    1  China State Construction Engineering          35672.76  China State Constrn Engg  35672.15
    2                     Royal Dutch Shell           6782.67                                    
    3  CHINA STATE CONSTRUCTION ENGINEERING           1000.15                                    ```