Search code examples
python-3.xpandasdataframegroup-bymapping

Map two dataframes, based on their group/id, in increasing order of the closer values in Python


I have two datafames as such:

#Load the required libraries
import pandas as pd
import matplotlib.pyplot as plt

#Create dataset_1
data_set_1 = {'id': [1,
                     2,
                     3, 
                     4,
                     5,
                     ],
              'Available_Salary': [10,
                                   20,
                                   30,
                                   40,
                                   50,
                                   ],
              }

#Convert to dataframe_1
df_1 = pd.DataFrame(data_set_1)
print("\n df_1 = \n",df_1)



#Create dataset_2
data_set_2 = {'id': [1, 
                     2, 
                     3, 
                     4,
                     ],
              'Expected_Salary': [16,
                                  42,
                                  28,
                                  32,
                                  ],
        }

#Convert to dataframe_2
df_2 = pd.DataFrame(data_set_2)
print("\n df_2 = \n",df_2)

Here, visually I can say, 'Expected_Salary' 16 (with id=1), is closer to 'Available_Salary' in the order as: 20 (with id=2), 10 (with id=1), 30 (with id=3), 40 (with id=4), 50 (with id=5)

Likewise, 'Expected_Salary' 42(with id=2), is closer to 'Available_Salary' in the order as: 40 (with id=4), 50 (with id=5), 30 (with id=3), 20 (with id=2), 10 (with id=1) and so on.

The same logic follows for other id's of df_2 to map with df_1.

I wish to see the result in a dataframe as such:

enter image description here

Can somebody please let me know how to achieve this task in Python?


Solution

  • Perform a cross-merge, compute the difference between the expected and available salary and use this to sort_values:

    out = (df_2
       .merge(df_1.rename(columns={'id': 'id_df1'}), how='cross')
       .eval('diff=abs(Available_Salary-Expected_Salary)')
       .sort_values(by=['id', 'diff'], ignore_index=True)
       .drop(columns='diff')
    )
    

    Output:

        id  Expected_Salary  id_df1  Available_Salary
    0    1               16       2                20
    1    1               16       1                10
    2    1               16       3                30
    3    1               16       4                40
    4    1               16       5                50
    5    2               42       4                40
    6    2               42       5                50
    7    2               42       3                30
    8    2               42       2                20
    9    2               42       1                10
    10   3               28       3                30
    11   3               28       2                20
    12   3               28       4                40
    13   3               28       1                10
    14   3               28       5                50
    15   4               32       3                30
    16   4               32       4                40
    17   4               32       2                20
    18   4               32       5                50
    19   4               32       1                10