Search code examples
pythonpandasdataframe

How to make a more efficient mapping function based on a dataframe with the index and column name of another dataframe as its dataset?


Say I have a dataframe with column ['Subtype'] and ['Building Condition'],

Dataframe 1:

Subtype Building Condition
A Good
B Bad
C Bad

I want to map dataframe 1 using another dataframe based on the values of those two columns

Dataframe 2:

Good Bad
A Repair Retrofit
B Retrofit Reconstruct
C Reconstruct Reconstruct

Iterate over the first dataframe and use the pd.df.at function for each row and append them to an empty list.

#assignment of intervention based on subtype and building condition
intervention_list = []
for index, row in bldg_df.iterrows():
    # print(bldg_df.at[index, 'Subtype'])
    intervention = matrix_df.at[bldg_df['Subtype'][index], 
                                bldg_df['Building Condition'][index]]
    intervention_list.append(intervention)   

bldg_df['Intervention'] = intervention_list

So the resulting dataframe would be

Subtype Building Condition Intervention
A Good Repair
B Bad Reconstruct
C Bad Reconstruct

This worked but I'm just thinking if there's a faster and more efficient way of going about this. Maybe using the map or merge as_of function?


Solution

  • The documented approach in such case is to use indexing lookup on the underyling array:

    # factorize the requested name
    idx, cols = pd.factorize(df1['Building Condition'])
    
    # reorder and slice
    df1['Intervention'] = (df2.reindex(index=df1['Subtype'], columns=cols)
                              .to_numpy()[np.arange(len(df1)), idx]
                           )
    

    Output:

      Subtype Building Condition Intervention
    0       A               Good       Repair
    1       B                Bad  Reconstruct
    2       C                Bad  Reconstruct
    

    This is also the most efficient approach:

    enter image description here