Search code examples
pythonpandaslevenshtein-distancefuzzy

How to match two dataframes by applying fuzzy_pandas merge on different columns?


I have an input and a master file in which I first tried doing an exact match using the fpd.fuzzy_merge function, which seems to be working perfectly. Afterwards, I want to start applying the levenshtein method with different thresholds on the remaining rows from the input file, excluding the ones that actually had an exact match in the previously executed command.

Sample input:

Raw desc                          Clean desc                  Brand
CAULIFLOWER TRI COLOR 6 CT SAL    cauliflower tri color sal   SYSCO
!SYRUP BLACKBERRY 2LB             syrup blackberry            TYSON
PUREE BRKFST SAUSAGE LINK 24CT    puree brkfst sausage link   TYSON

Master file:

Master raw desc               Master clean desc     SKU
SYRUP & BLACKBERRY 50Z        syrup blackberry      2356123
VEGETABLES MXD 6-10 GCHC      vegetables mxd gchc   4412620
3 LB PAPER FOOD BOAT 500/CS   paper food boat       4551210

After I am using the following code for an exact match, the output is the following which is the desired one. The challenge is to run the remaining of the sample input that didn't have an exact match through the Levenshtein method with different thresholds.

results1 = fpd.fuzzy_merge(
    sample_df, master_df, left_on="Clean desc", right_on="Master clean desc"
)

Output 1):

Raw desc               Clean desc        Brand  Master raw desc          Master clean desc   SKU
!SYRUP BLACKBERRY 2LB  syrup blackberry  TYSON  SYRUP & BLACKBERRY 5OZ   syrup blackberry    2356123

Afterwards, I use the following code using the Levenshtein method with a specific threshold, but no matter what parameter I use it only runs on the two columns I specified. I want to be able to run and return all the columns from both dataframes.

results = fpd.fuzzy_merge(
    sample_df,
    master_df,
    left_on="Clean desc",
    right_on="Master clean desc",
    method="levenshtein",
    threshold=0.85,
    join="left-outer",
    keep="match",
)

Any recs would be much appreciated!


Solution

  • With an extended version of your dataframes (to provide more than one match):

    import pandas as pd
    
    sample_df = pd.DataFrame(
        {
            "Raw desc": [
                "CAULIFLOWER TRI COLOR 6 CT SAL",
                "!SYRUP BLACKBERRY 2LB",
                "PUREE BRKFST SAUSAGE LINK 24CT",
            ],
            "Clean desc": [
                "cauliflower tri color sal",
                "syrup blackberry",
                "puree brkfst sausage link",
            ],
            "Brand": ["SYSCO", "TYSON", "TYSON"],
        }
    )
    
    master_df = pd.DataFrame(
        {
            "Master raw desc": [
                "SYRUP & BLACKBERRY 50Z",
                "VEGETABLES MXD 6-10 GCHC",
                "3 LB PAPER FOOD BOAT 500/CS",
                "CAUFLOWER 3 COLOR SIX SAL",
            ],
            "Master clean desc": [
                "syrup blackberry",
                "vegetables mxd gchc",
                "paper food boat",
                "flwrs multi shape vet",
            ],
            "SKU": [2356123, 4412620, 4551210, 999999],
        }
    )
    

    Here is one way to it with Pandas concat and a list comprehension:

    import fuzzy_pandas as fpd
    
    results = pd.concat(
        [
            fpd.fuzzy_merge(
                sample_df,
                master_df,
                left_on=left_on,
                right_on=right_on,
                method="levenshtein",
                ignore_case=True,
            )
            for left_on, right_on in (
                ("Raw desc", "Master raw desc"),
                ("Clean desc", "Master clean desc"),
            )
        ]
    ).drop_duplicates(ignore_index=True)
    

    Then:

    print(results)
    # Output
    
        Raw desc Clean desc  Brand Master raw desc Master clean desc      SKU
    0  CAULIF...  caulif...  SYSCO  CAUFLO...       flwrs ...          999999
    1  !SYRUP...  syrup ...  TYSON  SYRUP ...       syrup ...         2356123