Search code examples
python-3.xpandasdataframelookup-tables

Use a dataframe as lookup for another dataframe


I’ve two dataframes df_1 and df_2

df_1 is my master dataframe and df_2 is a lookup dataframe.

I want to test if the value in df_1[‘col_c1’] contains any of the the values from df_2[‘col_a2’].
If this is true (can be multiple matches !);

  • add the value(s) from df_2[‘col_b2’] to df_1[‘col_d1’]
  • add the value(s) from df_2[‘col_c2’] to df_1[‘col_e1’]

How can i achieve this?
I’ve really no idea and therefore I can’t share any code for this.

Sample df_1

col_a1 | col_b1 | col_c1         | col_d1 | col_e1
----------------------------------------------------
1_001  | aaaaaa | bbbbccccdddd   |        | 
1_002  | zzzzz  | ggggjjjjjkkkkk |        | 
1_003  | pppp   | qqqqffffgggg   |        | 
1_004  | sss    | wwwcccyyy      |        |
1_005  | eeeeee | eecccffffll    |        |
1_006  | tttt   | hhggeeuuuuu    |        |

Sample df_2

col_a2 | col_b2 | col_c2
------------------------------
ccc    | 2_001  | some_data_c
jjj    | 2_002  | some_data_j
fff    | 2_003  | some_data_f

Desired output df_1

col_a1 | col_b1 | col_c1         | col_d1       | col_e1
------------------------------------------------------------------------------
1_001  | aaaaaa | bbbbccccdddd   | 2_001        | some_data_c
1_002  | zzzzz  | ggggjjjjjkkkkk | 2_002        | some_data_j
1_003  | pppp   | qqqqffffgggg   | 2_003        | some_data_f
1_004  | sss    | wwwcccyyy      | 2_001        | some_data_c
1_005  | eeeeee | eecccffffll    | 2_001;2_003  | some_data_c; some_data_f
1_006  | tttt   | hhggeeuuuuu    |              |

df_1 has approx 45.000 rows and df_2 approx. 16.000 rows. (Also added a non matching row)

I've been struggling with this for hours, but I really have no idea.
I don't think merging is an option because there's no exact match.
Your help is greatly appreciated.


Solution

  • Use:

    #exctract values by df_2["col_a2"] to new column
    s = (df_1['col_c1'].str.extractall(f'({"|".join(df_2["col_a2"])})')[0].rename('new')
                       .reset_index(level=1, drop=True))
    
    #repeat rows with duplicated match
    df_1 = df_1.join(s)
    #add new columns by map
    df_1['col_d1'] = df_1['new'].map(df_2.set_index('col_a2')['col_b2'])
    df_1['col_e1'] = df_1['new'].map(df_2.set_index('col_a2')['col_c2'])
    #aggregate join
    cols = df_1.columns.difference(['new','col_d1','col_e1']).tolist()
    df = df_1.drop('new', axis=1).groupby(cols).agg(','.join).reset_index()
    
    print (df)
      col_a1  col_b1          col_c1       col_d1                   col_e1
    0  1_001  aaaaaa    bbbbccccdddd        2_001              some_data_c
    1  1_002   zzzzz  ggggjjjjjkkkkk        2_002              some_data_j
    2  1_003    pppp    qqqqffffgggg        2_003              some_data_f
    3  1_004     sss       wwwcccyyy        2_001              some_data_c
    4  1_005  eeeeee     eecccffffll  2_001,2_003  some_data_c,some_data_f