Search code examples
pythonpandasdataframemergelines-of-code

Find matching value in column and create another column pandas dataframe


Suppose I have the following dataframe:

ID  Country Employee    Location
1   AE       Jay        AAA
2   AE       Mary       aa
3   AE       Peter      bbb
3   AE       Peter      ddd
6   DK       Donk       ddd
7   CZ       Cesar      fff
7   CZ       Cesar      GGg
7   CZ       Cesar      
8   CZ       Carlos     #

I need to use the below dataframe to confirm if the Location values are valid (according to their Country) and create an extra column named "Legacy Location Name" with the following:

  • If values match with lookup dataframe (no matter if uppercase or not), add to the "Legacy Location Name" column "CORRECT VALUE"

  • If value of Location is incorrect, please add to the "Legacy Location Name" the value previously used in the "Location" column and, in "Location", add the first value of an existing Location of the lookup dataframe

  • If the value of Location is blank (like in the second to last row), add to "Legacy Location Name" the value "LOCATION NOT PROVIDED" and, in "Location", add the first value of an exising Location of the lookup dataframe

lookup df:

Country Location
AE      bbb
AE      aaa
AE      ccc
DK      ddd
DK      eee
DK      fff
CZ      ggg
CZ      hhh

Output Expected

ID  Country Employee    Location    Legacy Location
1   AE      Jay         AAA         CORRECT VALUE
2   AE      Mary        bbb         aa
3   AE      Peter       bbb         CORRECT VALUE
3   AE      Peter       bbb         ddd
6   DK      Donk        ddd         CORRECT VALUE
7   CZ      Cesar       ggg         fff
7   CZ      Cesar       GGg         CORRECT VALUE
7   CZ      Cesar                   LOCATION NOT PROVIDED
8   CZ      Carlos      ggg         #

What is the best way to achieve it?

Thank you!


Solution

  • Not complicated, but requires many steps:

    s = (lookup_df.drop_duplicates('Country')
         .set_index('Country')['Location']
         )
    
    out = (df
     # handle location independently of case
     .assign(Location=df['Location'].str.casefold())
     # identify the correct values by merging 
     .merge(lookup_df.assign(**{'Legacy Location': 'CORRECT VALUE'}),
              how='left')
     # replace invalid locations
     .assign(**{'Location': lambda d: df['Location'].mask(d['Legacy Location'].isna()).fillna(df['Country'].map(s).mask(df['Location'].isna())),
     # add previous invalid locations
                'Legacy Location': lambda d: d['Legacy Location'].fillna(df['Location'].fillna('LOCATION NOT PROVIDED'))})
     
     )
    
    print(out)
    

    NB. Assuming all empty cells are NaNs for simplicity.

    Output:

       ID Country Employee Location        Legacy Location
    0   1      AE      Jay      AAA          CORRECT VALUE
    1   2      AE     Mary      bbb                     aa
    2   3      AE    Peter      bbb          CORRECT VALUE
    3   3      AE    Peter      bbb                    ddd
    4   6      DK     Donk      ddd          CORRECT VALUE
    5   7      CZ    Cesar      ggg                    fff
    6   7      CZ    Cesar      GGg          CORRECT VALUE
    7   7      CZ    Cesar      NaN  LOCATION NOT PROVIDED
    8   8      CZ   Carlos      ggg                      #