Search code examples
pythonpandasdataframemergelookup-tables

Replace values from one column by comparing another column to a second DataFrame


I'm trying to replace NaN values from the air_store_id column of the df DataFrame using a particular criteria that involves searching into another DataFrame:

data = { 'air_store_id':     [ 'air_a1',   np.nan, 'air_a3',   np.nan,  'air_a5' ], 
         'hpg_store_id':     [ 'hpg_a1', 'hpg_a2',   np.nan, 'hpg_a4',    np.nan ],
                 'Test':     [ 'Alpha',    'Beta',  'Gamma',  'Delta', 'Epsilon' ]
       }

df = pd.DataFrame(data)
display(df)

enter image description here

When a NaN is found in df.air_store_id I would like to use the value from df.hpg_store_id (when there is one) to compare it to the same column in another DataFrame called id_table_df, and retrieve its air_store_id.

Here is what id_table_df looks like:

ids_data = { 'air_store_id':     [ 'air_a1', 'air_a4', 'air_a3', 'air_a2' ], 
             'hpg_store_id':     [ 'hpg_a1', 'hpg_a4', 'hpg_a3', 'hpg_a2' ] }

id_table_df = pd.DataFrame(ids_data)
display(id_table_df)

enter image description here

Simply put, for every NaN in df.air_store_id replace it for the appropriate equivalent in id_table_df.air_store_id by comparing df.hpg_store_id with id_table_df.hpg_store_id.

In this scenario, id_table_df ends up working as a lookup table. The resulting DataFrame would look like this:

enter image description here

I've tried to merge them with the following instruction, but an error is thrown:

df.loc[df.air_store_id.isnull(), 'air_store_id'] = df.merge(id_table_df, on='hpg_store_id', how='left')['air_store_id']

Error message:

KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:
...
...
...
KeyError: 'air_store_id'

Question 1: how can I accomplish?

Question 2: is there a way to do this at simultaneously for both columns (air_store_id and hpg_store_id)? If it were possible I wouldn't have to run the merge separately for each column.


Solution

  • Using pd.Series.map after using set_index on id_table_df

    df.fillna(
        df.hpg_store_id.map(
            id_table_df.set_index('hpg_store_id').air_store_id
        ).to_frame('air_store_id')
    )
    
          Test air_store_id hpg_store_id
    0    Alpha       air_a1       hpg_a1
    1     Beta       air_a2       hpg_a2
    2    Gamma       air_a3          NaN
    3    Delta       air_a4       hpg_a4
    4  Epsilon       air_a5          NaN
    

    Simultaneously

    v = id_table_df.values
    a2h = dict(v)
    h2a = dict(v[:, ::-1])
    df.fillna(
        pd.concat([
            df.hpg_store_id.map(h2a),
            df.air_store_id.map(a2h),
        ], axis=1, keys=['air_store_id', 'hpg_store_id'])
    )
    
          Test air_store_id hpg_store_id
    0    Alpha       air_a1       hpg_a1
    1     Beta       air_a2       hpg_a2
    2    Gamma       air_a3       hpg_a3
    3    Delta       air_a4       hpg_a4
    4  Epsilon       air_a5          NaN
    

    Creative Solution
    requires Python 3

    v = id_table_df.values
    a2h = dict(v)
    h2a = dict(v[:, ::-1])
    col = id_table_df.columns
    swch = dict(zip(col, col[::-1]))
    df.fillna(df.applymap({**a2h, **h2a}.get).rename(columns=swch))
    
          Test air_store_id hpg_store_id
    0    Alpha       air_a1       hpg_a1
    1     Beta       air_a2       hpg_a2
    2    Gamma       air_a3       hpg_a3
    3    Delta       air_a4       hpg_a4
    4  Epsilon       air_a5         None