Search code examples
pythonpandasdataframeweighted

weighted sum on multiple dataframes


I have several dataframes with an ID, a time of day, and a number. I would like to weight each dataframe number and then sum them for each id/time of day. As an example:

weighted 0.2
    ID   TOD         M 
0    10   morning    1  
1    13   afternoon  3  
2    32   evening    2
3    10   evening    2

weighted 0.4
    ID   TOD         W 
0    10   morning    1  
1    13   morning    3  
2    32   afternoon  2
3    10   evening    3

weighted sum:

    ID   TOD         weighed_sum_mw
0    10   morning    (0.2*1 + 0.4*1)
1    10   evening    (0.2*2 + 0.4*3)
2    13   morning    (0.4*3)
3    13   afternoon  (0.4*2)
3    32   evening    (0.2*2)
4    32   afternoon  (0.4*2)

The following strategy works but is very memory consuming and I'm not sure if there's a way to do this without merging them. I also eventually only need the row with the time of day that has the largest sum for each ID so if that simplifies the process that works as well! (Tie breakers for equal max weighted sums will keep Afternoon, then Evening, then Morning first). I currently do this with 4 dataframes but may add more and they are approximately 10M rows each

merged_oc= pd.merge(dfs[0], dfs[3], on=['ID', 'TIME_OF_DAY'], suffixes=('_O', '_C'), how='outer')
merged_s = pd.merge(dfs[1], dfs[2], on=['ID', 'TIME_OF_DAY'], suffixes=('_W', 'M'), how='outer')

# merge and weighted sum of O and C
merged_oc['COUNTS_O_weighted_02']= merged_oc['COUNTS_O'].fillna(0).multiply(0.2)
merged_oc['COUNTS_C_weighted_04'] = merged_oc['COUNTS_C'].fillna(0).multiply(0.4)
merged_oc['COUNTS'] = merged_oc['COUNTS_O_weighted_02'] + merged_oc['COUNTS_C_weighted_04']
result_oc = merged_oc[['ID', 'TIME_OF_DAY', 'COUNTS', 'COUNTS_O_weighted_02', 'COUNTS_C_weighted_04']]

merged_s['COUNTS_W_weighted_04'] = merged_s['COUNTS_W'].fillna(0).multiply(0.4)
merged_s['COUNTS_M_weighted_04'] = merged_s['COUNTS_M'].fillna(0).multiply(0.4)
merged_s['COUNTS'] = merged_s['COUNTS_W_weighted_04'] + merged_s['COUNTS_M_weighted_04']
result_s = merged_s[['ID', 'TIME_OF_DAY', 'COUNTS', 'COUNTS_W_weighted_04', 'COUNTS_M_weighted_04']]

merged_final = pd.merge(result_oc, result_s, on=['ID', 'TIME_OF_DAY'], suffixes=('_OC', '_S'), how='outer')

merged_final['COUNTS_OC']= merged_final['COUNTS_OC'].fillna(0)
merged_final['COUNTS_S'] = merged_final['COUNTS_S'].fillna(0)
merged_final['WEIGHTED_SUM'] = merged_final['COUNTS_OC'] + merged_final['COUNTS_SESSION']
merged_final = merged_final[['ID', 'TIME_OF_DAY', 'WEIGHTED_SUM', 'COUNTS_O_weighted_02', 'COUNTS_C_weighted_04', 'COUNTS_W_weighted_04', 'COUNTS_M_weighted_04']].fillna(0)

Solution

  • IIUC, you can try pd.concat the dataframes after you set index and multiply by your weights for each dataframe, then use groupby and sum:

    df_out = pd.concat([df1_2.set_index(['ID', 'TOD']).mul(.2), 
                        df2_4.set_index(['ID', 'TOD']).mul(.4)])\
               .sum(axis=1)\
               .groupby(level=[0,1])\
               .sum()\
               .reset_index()
    
    df_out
    

    Output:

       ID        TOD    0
    0  10    evening  1.6
    1  10    morning  0.6
    2  13  afternoon  0.6
    3  13    morning  1.2
    4  32  afternoon  0.8
    5  32    evening  0.4