Search code examples
pythonpandasdataframedata-preprocessing

How to get 2 data frames matched based on relation between two columns of the two dataframe


I have two dataframes with data in the form of

Date     Col1 Col2 Col3
1/1/2021  a    b    c
2/1/2021  d    e    f
3/1/2021  g    h    1

Date     Col4 Col5 Col6 
1/1/2021  a    b   c
2/1/2021  d    e   f
3/1/2021  g    h   i

I have a relation that says

Cola Colb
Col1 Col4
Col2 Col5
Col3 Col6

What I am trying to achieve is

Date     Cola Colb 
1/1/2021   a    a
2/1/2021   d    d
3/1/2021   g    g
1/1/2021   b    b
2/1/2021   e    e
3/1/2021   h    h
1/1/2021   c    c
2/1/2021   f    f
3/1/2021   l    l

My original plan was to use unpivot the initial data frame and use merge to get the data, but since the data set is huge, it is not a viable option. Can anyone suggest an efficient way to combine the data frames based on the relation between the columns?


Solution

  • You could convert the mapper_df:

    Cola    Colb
    0   Col1    Col4
    1   Col2    Col5
    2   Col3    Col6
    

    to a dictionary and modify the column names of df2. Then stack the DataFrames and join on "Date":

    d = mapper_df.set_index('Cola')['Colb'].to_dict()
    df2.columns = df2.columns.map({k: d.get(k, k) for k in df2.columns})
    out = df1.set_index('Date').stack().to_frame().join(df2.set_index('Date').stack().to_frame(), lsuffix='_Cola', rsuffix='_Colb').droplevel(1)
    out = out.rename(columns={c:c.split('_')[1] for c in out.columns})
    

    Output:

             Cola Colb
    Date              
    1/1/2021    a    a
    1/1/2021    b    b
    1/1/2021    c    c
    2/1/2021    d    d
    2/1/2021    e    e
    2/1/2021    f    f
    3/1/2021    g    g
    3/1/2021    h    h
    3/1/2021    1    i