I have a pandas dataframe for which I'm trying to compute an expanding windowed aggregation after grouping by columns. The data structure is something like this:
df = pd.DataFrame([['A',1,2015,4],['A',1,2016,5],['A',1,2017,6],['B',1,2015,10],['B',1,2016,11],['B',1,2017,12],
['A',1,2015,24],['A',1,2016,25],['A',1,2017,26],['B',1,2015,30],['B',1,2016,31],['B',1,2017,32],
['A',2,2015,4],['A',2,2016,5],['A',2,2017,6],['B',2,2015,10],['B',2,2016,11],['B',2,2017,12]],columns=['Typ','ID','Year','dat'])\
.sort_values(by=['Typ','ID','Year'])
i.e.
Typ ID Year dat
0 A 1 2015 4
6 A 1 2015 24
1 A 1 2016 5
7 A 1 2016 25
2 A 1 2017 6
8 A 1 2017 26
12 A 2 2015 4
13 A 2 2016 5
14 A 2 2017 6
3 B 1 2015 10
9 B 1 2015 30
4 B 1 2016 11
10 B 1 2016 31
5 B 1 2017 12
11 B 1 2017 32
15 B 2 2015 10
16 B 2 2016 11
17 B 2 2017 12
I need to group this dataframe by the columns Type
and ID
, then compute an expanding mean of the all observations by Year
. The code I've written is
df.groupby(by=['Typ','ID','Year']).expanding().mean().reset_index()
from which I expect output results like this (ignoring level_3
):
Typ ID Year level_3 dat
0 A 1 2015 6 14.0
1 A 1 2016 7 14.5
2 A 1 2017 8 15.0
3 A 2 2015 12 4.0
4 A 2 2016 13 4.5
5 A 2 2017 14 5.0
6 B 1 2015 9 20.0
7 B 1 2016 10 20.5
8 B 1 2017 11 21.0
9 B 2 2015 15 10.0
10 B 2 2016 16 10.5
11 B 2 2017 17 11.0
Grouping by ['Type','ID','Year']
should result in a single row for each unique row of these columns. Instead, the code is giving this:
Typ ID Year level_3 dat
0 A 1 2015 0 4.0
1 A 1 2015 6 14.0
2 A 1 2016 1 5.0
3 A 1 2016 7 15.0
4 A 1 2017 2 6.0
5 A 1 2017 8 16.0
6 A 2 2015 12 4.0
7 A 2 2016 13 5.0
8 A 2 2017 14 6.0
9 B 1 2015 3 10.0
10 B 1 2015 9 20.0
11 B 1 2016 4 11.0
12 B 1 2016 10 21.0
13 B 1 2017 5 12.0
14 B 1 2017 11 22.0
15 B 2 2015 15 10.0
16 B 2 2016 16 11.0
17 B 2 2017 17 12.0
The expanding()
windowing function does not seem to be working with the groupby
correctly, or at least it is not behaving as I expect, given the logic. What am I doing wrong?
Edit: I see now what I'm doing wrong, in that I was expecting different integration between groupby
and expanding
. So now my question is how can I use pandas to get the output I want - without any manual iteration.
Expanding mean to my knowledge has a different calculation way. For the output you want, I would do the following using a combination of groupby
and cumsum
, and later a simple division between sum
and count
:
newDf = df.groupby(['Typ','ID','Year'])['dat'].agg(('sum', 'count')).groupby(['Typ','ID']).cumsum()
newDf['dat'] = newDf['sum']/newDf['count']
newDf = newDf.reset_index().drop(['count', 'sum'], axis = 1)
Output:
Typ ID Year dat
0 A 1 2015 14.0
1 A 1 2016 14.5
2 A 1 2017 15.0
3 A 2 2015 4.0
4 A 2 2016 4.5
5 A 2 2017 5.0
6 B 1 2015 20.0
7 B 1 2016 20.5
8 B 1 2017 21.0
9 B 2 2015 10.0
10 B 2 2016 10.5
11 B 2 2017 11.0