Search code examples
pythonpandasdataframematrix-multiplication

Calculate and add up Data from a reference dataframe


I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.

So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column. Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658

In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.

df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'], 
              'L1.09A': [1205,1253,1852,1452,1653],
              'L1.10A': [7562,7400,5700,4586,4393],
              'L1.10C': [1332, 0, 700,1180,290]})

df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'], 
              'R21 17': [0.526499,0.003115,0.000267],
              'R21 26': [0.458956,0,0.001819]})
Index      1.09A L1.10A L1.10C
205368421  1205  7562   1332
206321177  1253  7400   0
202574796  1852  5700   700
200212811  1452  4586   1180
204376114  1653  4393   290

WorkerID R21 17   R21 26
L1.09A   0.526499 0.458956
L1.10A   0.003115 0
L1.10C   0.000267 0.001819

I want this:

Index       L1.09A  L1.10A  L1.10C  R21 17  R21 26
205368421   1205    7562    1332    658     555
206321177   1253    7400    0       683     575
202574796   1852    5700    700     993     851
200212811   1452    4586    1180    779     669
204376114   1653    4393    290     884     759

I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?

Edit: I have now done the following:

df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')

common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]

df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)

df_multiplied = df1_sorted @ df2_sorted

This works with my example dataframes, but not with my real dataframes. My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).

It should work, but my df_multiplied is full with NaN.


Solution

  • Alright, I did it!

    I had to replace all nan with 0.

    So the final solution is:

    df1 = df1.set_index('Index')
    df2 = df2.set_index('WorkerID')
    
    common_cols = list(set(df1.columns).intersection(df2.index))
    df2 = df2.loc[common_cols]
    
    df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
    df2_sorted = df2.sort_index(axis=0)
    
    df1_sorted= df1_sorted.fillna(0)
    df2_sorted= df2_sorted.fillna(0)
    
    df_multiplied = df1_sorted @ df2_sorted