Search code examples
python-3.xpandasdataframetime-seriesweighted-average

Calculating weighted sum over different time series using pd.concat


I have multiple time seires who has different data density and different length, and I want to calculate the weighted sum. For example, df1, df2 and df3:

Out[467]: 
datetime_doy
2017-01-01    0.308632
2017-01-02    0.307647
2017-01-03    0.306493
2017-01-04    0.292955
2017-01-10    0.369009
  
2019-12-27    0.387553
2019-12-28    0.383481
2019-12-29    0.382838
2019-12-30    0.379383
2019-12-31    0.379172
Name: df1, Length: 1055, dtype: float64

datetime_doy
2017-01-01    0.310446
2017-01-02    0.309330
2017-01-03    0.308632
2017-01-04    0.306234
2017-01-10    0.317367
  
2019-12-27    0.387510
2019-12-28    0.383549
2019-12-29    0.382762
2019-12-30    0.379483
2019-12-31    0.379078
Name: df2, Length: 1042, dtype: float64


datetime_doy
2017-01-01    0.302718
2017-01-02    0.301939
2017-01-03    0.301440
2017-01-04    0.300281
2017-01-05    0.299731
  
2017-08-27    0.227604
2017-08-28    0.227431
2017-08-30    0.227167
2017-08-31    0.237400
2017-09-01    0.243424
Name: df3, Length: 227, dtype: float64

I know that if I want to calculate the mean, I can just use pd.concat([df1, df2, df3],axis=1).mean(axis=1)like

pd.concat([df1, df2, df3],axis=1).mean(axis=1)
Out[475]: 
datetime_doy
2017-01-01    0.307265
2017-01-02    0.306305
2017-01-03    0.305522
2017-01-04    0.299823
2017-01-05    0.299731
  
2019-12-27    0.387532
2019-12-28    0.383515
2019-12-29    0.382800
2019-12-30    0.379433
2019-12-31    0.379125
Length: 1065, dtype: float64

but what if I want to calculate the weighted average of df1, df2 and df3? say, weight is 0.1, 0.2 and 0.3. On time t, if there are values of df1 and df2, then new values is (0.1*df1.iloc[t] + 0.2*df2.iloc[t])/(0.1+0.2). If on time t, there are values of df1, df2 and df3, then it is (0.1*df1.iloc[t] + 0.2*df2.iloc[t] + 0.3*df3.iloc[t])/(0.1+0.2+0.3). If there is no value for all dataframes, then it's just simply np.nan (note that df3 only have data in 2017). So how can I get it? Thanks!


Solution

  • I have found a solution to your problem by creating a separated pd.DataFrame for the weights. This way, you can have the sum of values for each day and the sum of weights for each day separated. I have created an example to illustrate my point:

    a = ["2022-12-01", "2022-12-02", "2022-12-03", "2022-12-04", "2022-12-05"]
    b = ["2022-12-03", "2022-12-04", "2022-12-05", "2022-12-06", "2022-12-07"]
    c = ["2022-12-05", "2022-12-06", "2022-12-07", "2022-12-08", "2022-12-09"]
    
    WEIGHT1 = 0.1
    WEIGHT2 = 0.2
    WEIGHT3 = 0.3
    
    df1 = pd.DataFrame(data = np.random.normal(size=5), index=a, columns=["a"])
    df2 = pd.DataFrame(data = np.random.normal(size=5), index=b, columns=["b"])
    df3 = pd.DataFrame(data = np.random.normal(size=5), index=c, columns=["c"])
    

    I have defined the above dates for my dataframes and weights following your example. As you pointed in your question, we have dates that belong to all three dataframes, that belong to only two or that are unique to a df. I have also filled the values with random values.

    df1_weight = pd.DataFrame(data = WEIGHT1, index=df1.index, columns=["weight1"])
    df2_weight = pd.DataFrame(data = WEIGHT2, index=df2.index, columns=["weight2"])
    df3_weight = pd.DataFrame(data = WEIGHT3, index=df3.index, columns=["weight3"])
    
    pd.concat([df1*WEIGHT1, df2*WEIGHT2, df3*WEIGHT3], axis=1).sum(axis=1).rename("sum_values").to_frame().join(pd.concat([df1_weight, df2_weight, df3_weight], axis=1).sum(axis=1).rename("sum_weights"))
    

    My proposed solution consists in creating three dataframes, one for each weight and concat them as you did in the question. With the last line I concat all the values and all the weight and I add them for each day, this way you only need to divide both columns to obtain the desired values.

    Hope it helps!