Search code examples
pythonpandasfuzzy-searchfuzzywuzzyfuzzy-comparison

How to apply fuzzy matching across a dataframe column with multiple lists and save results in a new column


I have a similar problem to the links provided in the following references with minor differences but want the same results:

I have on dataframe and want to get the partial ratio and token between 2 columns within the dataframe. Column 1 is just one word per row, but column 2 is a list of words with each row varying in size(I changed it to a tuple to make the functions in the references work).

The main issue I get is that in the compare it goes through column 1 and compares each element to every element in column 2 thus creating a massive dataframe when I just want it 1 to 1. How can I fix this?

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6],
        "fruits": ["apple", "apples", "orange", "apple tree", "oranges", "mango"],
        "choices": [
            ("app", "apull", "apple"),
            ("app", "apull", "apple", "appple"),
            ("orange", "org"),
            ("apple"),
            ("oranges", "orang"),
            ("mango"),
        ],
    }
)
    
   id      fruits      choices
0   1       apple      ('app', 'apull', 'apple')
1   2      apples      ('app', 'apull', 'apple', 'appple')
2   3      orange      ('orange', 'org')
3   4  apple tree      ('apple')
4   5     oranges      ('oranges', 'orang')
5   6       mango      ('mango')

What compare gives me in the variable explorer:

compare = pd.MultiIndex.from_product([df['fruits'], df['choices']]).to_series()

             fruits      choices
    0         apple      ('app', 'apull', 'apple')
    1         apple      ('app', 'apull', 'apple', 'appple')
    2         apple      ('orange', 'org')
    3         apple      ('apple')
    4         apple      ('oranges', 'orang')
    5         apple      ('mango')
    6         apples     ('app', 'apull', 'apple')
    7         apples     ('app', 'apull', 'apple', 'appple')
    8         apples     ('orange', 'org')
    ...

Is it possible to get the desired output like the first output in reference 1 but the multi-indexed elements as the choices?

Expected output like in reference #1, but I want the choices multi-indexed:

Expected output


Solution

  • Here is one way to do it with Pandas explode and groupby:

    from fuzzywuzzy import fuzz
    
    new_df = (
        df.explode("choices")
        .drop(columns="id")
        .pipe(
            lambda df_: df_.assign(
                ratio=df_.apply(lambda x: fuzz.ratio(x["fruits"], x["choices"]), axis=1),
                token=df_.apply(
                    lambda x: fuzz.token_sort_ratio(x["fruits"], x["choices"]), axis=1
                ),
            )
        )
        .groupby(["fruits", "choices"])
        .agg(list)
        .applymap(lambda x: x[0])
    )
    
    

    Then:

    print(new_df)
    # Output
                        ratio  token
    fruits     choices
    apple      app         75     75
               apple      100    100
               apull       60     60
    apple tree apple       67     67
    apples     app         67     67
               apple       91     91
               appple      83     83
               apull       55     55
    mango      mango      100    100
    orange     orange     100    100
               org         67     67
    oranges    orang       83     83
               oranges    100    100