Search code examples
pythonpandasdataframeweighted-average

Weighted average grouping by date (in index) in pandas DataFrame


I need some help trying to obtain a weighted average for a column, based on datetime-index and another 'mass' column. For example:

d = {'date': ['2021-08-01 12:00:00', '2021-08-01 13:00:00', '2021-08-01 14:00:00', '2021-08-02 15:00:00', '2021-08-02 16:00:00', '2021-08-02 17:00:00'], 
     'mass': [23,   40,  10,  12,  15,  11], 
     '%':    [0.4, 0.7, 0.9, 0.1, 0.2, 0.8]
    }
df = pd.DataFrame(data=d)
df.set_index('date')

What I need is to obtain the weighted average of '%' every 2 hours, being 'mass' the column of weights.

I need the output to be like this:

date(index)         | mass |  %
2021-08-01 13:00:00 |  43  | 0.865
2021-08-02 15:00:00 |  22  | 0.464
2021-08-02 17:00:00 |  26  | 0.454

Being '%' calculated as weighted average:

0.865 = ((23 x 0.4) + (40 x 0.7))/ (40 + 23)

Please note that I may not have continuous datetimes in index, for example, I could skip several hours because of lack of data, or even days. In that case, I need the output to be NaN if there is no data to calculate with. In case there is only one row of data, assume that row to be the weighted average for the period...

Periods may vary as well, from 2 hours to 12 months...

On a previous question, someone already helped me with this:

out = df.assign(k=df['mass'].mul(df['%']))[['mass','k']].sum(level=0)
out['%'] = out.pop('k').div(out['mass'])

Though, this works only when I have repeated indexes and this time this is not the case.


Solution

  • Try grouping every 2 hours and you will get closer -

    d = {'date': ['2021-08-01 12:00:00', '2021-08-01 13:00:00', '2021-08-01 14:00:00', '2021-08-02 15:00:00', '2021-08-02 16:00:00', '2021-08-02 17:00:00'], 
         'mass': [23,   40,  10,  12,  15,  11], 
         '%':    [0.4, 0.7, 0.9, 0.1, 0.2, 0.8]
        }
    df = pd.DataFrame(data=d)
    df['date'] = pd.to_datetime(df['date'])
    df = df.set_index('date')
    
    df['mass_wt'] = df['mass'] * df['%']
    op = df.groupby(pd.Grouper(freq='2H')).agg({'mass': 'sum', 'mass_wt': 'sum'}).query('mass > 0')
    op['op'] = op['mass_wt'] / op['mass']