df = pd.DataFrame(
{"date": [pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-05")],
"numbers": [1,2,4,4,11,7,5],
"grouper": [1, 0, 1, 0,1, 0, 0]
}
)
If I have the following df and I would like to get the rolling mean for the values of numbers that are before each rows date column, how would I do that? eg. the rolling averages for the past 3 days for grouped by ["grouper", "date"]
I know I can do something like this, but not even close to a solution -
I am looking to build on this solution
df["av"] = df.shift(1).rolling(window=3).mean()
but this does not shift dynamically so it includes today.
My expected output for the new av column for a 3 day window grouped by the two columns of the sample df would be
date numbers grouper av
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5
You need average by definition - sum / count
.
df1 = (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
.shift()
.stack()
)
df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
print (df)
date numbers grouper aw
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5
Explanation:
First aggregate sum
and count by GroupBy.size
:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
)
sum size
date grouper
2022-01-01 0 2 1
1 5 2
2022-01-03 0 11 2
1 11 1
2022-01-05 0 5 1
Then for DatatimeIndex reshape by DataFrame.unstack
:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
)
sum size
grouper 0 1 0 1
date
2022-01-01 2.0 5.0 1.0 2.0
2022-01-03 11.0 11.0 2.0 1.0
2022-01-05 5.0 NaN 1.0 NaN
Add missing consecutive datetimes by DataFrame.asfreq
:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
)
sum size
grouper 0 1 0 1
date
2022-01-01 2.0 5.0 1.0 2.0
2022-01-02 0.0 0.0 0.0 0.0
2022-01-03 11.0 11.0 2.0 1.0
2022-01-04 0.0 0.0 0.0 0.0
2022-01-05 5.0 NaN 1.0 NaN
Then use rolling with sum
(processing sum and counts):
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
)
sum size
grouper 0 1 0 1
date
2022-01-01 2.0 5.0 1.0 2.0
2022-01-02 2.0 5.0 1.0 2.0
2022-01-03 13.0 16.0 3.0 3.0
2022-01-04 11.0 11.0 2.0 1.0
2022-01-05 16.0 11.0 3.0 1.0
Use DataFrame.shift
:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
.shift()
)
sum size
grouper 0 1 0 1
date
2022-01-01 NaN NaN NaN NaN
2022-01-02 2.0 5.0 1.0 2.0
2022-01-03 2.0 5.0 1.0 2.0
2022-01-04 13.0 16.0 3.0 3.0
2022-01-05 11.0 11.0 2.0 1.0
Reshape back by DataFrame.stack
:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
.shift()
.stack()
)
sum size
date grouper
2022-01-02 0 2.0 1.0
1 5.0 2.0
2022-01-03 0 2.0 1.0
1 5.0 2.0
2022-01-04 0 13.0 3.0
1 16.0 3.0
2022-01-05 0 11.0 2.0
1 11.0 1.0
For averages divide columns:
print (df1['sum'].div(df1['size']).rename('aw'))
date grouper
2022-01-02 0 2.000000
1 2.500000
2022-01-03 0 2.000000
1 2.500000
2022-01-04 0 4.333333
1 5.333333
2022-01-05 0 5.500000
1 11.000000
Name: aw, dtype: float64
And append to original:
df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
print (df)
date numbers grouper aw
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5