Search code examples
python-3.xpandasloopsnamingfuzzywuzzy

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)


Low-level python skills here (learned programming with SAS).

I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.

Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.

Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.

Here is my formulas dict:

# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio, 
          "partial ratio": fuzz.partial_ratio, 
          "token sort ratio": fuzz.token_sort_ratio, 
          "partial token sort ratio": fuzz.partial_token_sort_ratio,
          "token set ratio": fuzz.token_set_ratio,
          "partial token set ratio": fuzz.partial_token_set_ratio
          }

And here is the loop I am currently sweating over:

# for loop iterating over ratios
for r, rn in ratios.items():

    # fuzzing function definition
    def do_the_fuzz(row):
        return rn(row[base_column], row[target_column])

    # new base df containing ratio data and calculations for current loop turn
    df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
    df_out1['mesure'] = r
    df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)

It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').

My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).

I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!


Solution

  • I would create a dataframe that is updated at each loop iteration:

    final_df = pd.DataFrame()
    for r, rn in ratios.items():
        ...
        df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
        df_out1['mesure'] = r
        df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
    
        final_df = pd.concat([final_dfl, df_out1], axis=0)
    

    I hope this can help you.