Search code examples
pythonpython-2.7pandasstring-matchingfuzzywuzzy

Searching one Python dataframe / dictionary for fuzzy matches in another dataframe


I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):

df1:

            PRODUCT_ID           PRODUCT_DESCRIPTION
0           165985858958         "Fish Burger with Lettuce"
1           185965653252         "Chicken Salad with Dressing"
2           165958565556         "Pork and Honey Rissoles"
3           655262522233         "Cheese, Ham and Tomato Sandwich"
4           857485966653         "Coleslaw with Yoghurt Dressing"
5           524156285551         "Lemon and Raspberry Cheesecake"

I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:

df2 (also saved as dict_2)

       PROD_ID   PROD_DESCRIPTION
0      548576    "Fish Burger"
1      156956    "Chckn Salad w/Ranch Dressing"
2      257848    "Rissoles - Lamb & Rosemary"
3      298770    "Lemn C-cake"
4      651452    "Potato Salad with Bacon"
5      100256    "Cheese Cake - Lemon Raspberry Coulis"

What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:

     PRODUCT_ID      PRODUCT_DESCRIPTION               PROD_ID   PROD_DESCRIPTION
0    165985858958    "Fish Burger with Lettuce"        548576    "Fish Burger"
1    185965653252    "Chicken Salad with Dressing"     156956    "Chckn Salad w/Ranch Dressing"
2    165958565556    "Pork and Honey Rissoles"         257848    "Rissoles - Lamb & Rosemary"     
3    655262522233    "Cheese, Ham and Tomato Sandwich" NaN       NaN
4    857485966653    "Coleslaw with Yoghurt Dressing"  NaN       NaN
5    524156285551    "Lemon and Raspberry Cheesecake"  298770    "Lemn C-cake"
6    524156285551    "Lemon and Raspberry Cheesecake"  100256    "Cheese Cake - Lemon Raspberry Coulis"

I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.

I've used the following questions and answers (amongst others):

Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array

and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.

I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.

I'm using Python 2.7, pandas and have fuzzywuzzy installed.


Solution

  • using fuzz.ratio as my distance metric, calculate my distance matrix like this

    df3 = pd.DataFrame(index=df.index, columns=df2.index)
    
    for i in df3.index:
        for j in df3.columns:
            vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
            vj = df2.get_value(j, 'PROD_DESCRIPTION')
            df3.set_value(
                i, j, fuzz.ratio(vi, vj))
    
    print(df3)
    
        0   1   2   3   4   5
    0  63  15  24  23  34  27
    1  26  84  19  21  52  32
    2  18  31  33  12  35  34
    3  10  31  35  10  41  42
    4  29  52  32  10  42  12
    5  15  28  21  49   8  55
    

    Set a threshold for acceptable distance. I set 50
    Find the index value (for df2) that has maximum value for every row.

    threshold = df3.max(1) > 50
    idxmax = df3.idxmax(1)
    

    Make assignments

    df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
    df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
    df
    

    enter image description here