Search code examples
pythonpandasmapping

Pandas Map Multiple Columns Based on Specific Conditions


My organization uses special codes for various employee attributes. We are migrating to a new system and I have to map these codes to a new code based on certain logic.

Here is my mappings df Mappings:

State    Old_Mgmt    New_Mgmt   Old_ID   New_ID   New_Site
   01        A001        A100     0000     0101        123
   01        A002        A100     0000     0102       
   01        A003        A105     0000     0103        123       
   02        A001        A100     0000     0101        

And here is EmployeeData:

State      Management     ID    Site
   01            A001   0000     456
   01            A002   0000     987
   02            A002   0000     987
....

The logic for the mapping is to go through each row of EmployeeData and if there is a match for State, Management, and ID, then it will update to the corresponding New_ value. However for Site, it will update the Site ID only if New_Site is not blank/NaN. This mapping will modify the original dataframe.

Based on the above mapping the new EmployeeData would be:

State      Management     ID    Site
   01            A100   0101     123 (modified this row)
   01            A100   0102     987 (modified this row)
   02            A002   0000     987
....

My initial thought process was to do something like this:

for i,r in EmployeeData.iterrows(): # For each employee row
    # Create masks for the filters we are looking for
    mask_state = Mappings['State'] == r['State']
    mask_mgmt = Mappings['Old_Mgmt'] == r['Management']
    mask_id = Mappings['Old_ID'] == r['ID']

    # Filter mappings for the above 3 conditions
    MATCH = Mappings[mask_state & mask_mgmt & mask_id]

    if MATCH.empty: # No matches found
        print("No matches found in mapping. No need to update. Skipping.")
        continue
    
    MATCH = MATCH.iloc[0] # If a match is found, it will correspond to only 1 row
    EmployeeData.at[i, 'Management'] = MATCH['New_Mgmt']
    EmployeeData.at[i, 'ID'] = MATCH['New_ID']
    if pd.notna(MATCH['New_Site']):
        EmployeeData.at[i, 'Site'] = MATCH['New_Site']

However this seems fairly inefficient because I have to filter mappings for every row. If only 1 column was being mapped, I would do something like:

# Make a dict mapping Old_Mgmt -> New_Mgmt
MGMT_MAPPING = pd.Series(Mappings['New_Mgmt'].values,index=Mappings['Old_Mgmt']).to_dict()
mask_state = Mappings['State'] = r['State']

EmployeeData.loc[mask_state, 'Management'] = EmployeeData.loc[mask_state, 'Management'].replace(MGMT_MAPPING)

But that would not work for my situation since I need to map multiple values


Solution

  • Left-merge and update:

    EmployeeData.update(EmployeeData
      .rename(columns={'Management': 'Old_Mgmt', 'ID': 'Old_ID'})
      .merge(Mappings.rename(columns={'New_Mgmt': 'Management', 'New_ID': 'ID', 'New_Site': 'Site'}),
             how='left', on=['State', 'Old_Mgmt', 'Old_ID'], suffixes=('_old', None))
      .replace('', None)[EmployeeData.columns]
     )
    

    Updated EmployeeData:

      State Management    ID Site
    0    01       A100  0101  123
    1    01       A100  0102  987
    2    02       A002  0000  987