I need to group by multiple columns on a dataframe and calculate the rolling mean in the group. But the original index needs to be preserved.
Simple python code below :
data = {'values': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'type':['A','B','A','B','A','B','A','B','A','B','A','B','A','B','A'],
'type2':['C','D','C','D','C','D','C','D','C','D','C','D','C','D','C']}
df = pd.DataFrame(data)
window_size = 3
df['mean'] = (df.groupby(['type','type2'])['values'].rolling(window=3).mean().reset_index(drop=True))
print(df)
Output :
values type type2 mean
0 1 A C NaN
1 2 B D NaN
2 3 A C 3.0
3 4 B D 5.0
4 5 A C 7.0
5 6 B D 9.0
6 7 A C 11.0
7 8 B D 13.0
8 9 A C NaN
9 10 B D NaN
10 11 A C 4.0
11 12 B D 6.0
12 13 A C 8.0
13 14 B D 10.0
14 15 A C 12.0
What I need :
values type type2 mean
0 1 A C 1
1 2 B D 2
2 3 A C 2
3 4 B D 3
4 5 A C 3
5 6 B D 4
6 7 A C 5
7 8 B D 6
8 9 A C 7
9 10 B D 8
10 11 A C 9
11 12 B D 10
12 13 A C 11
13 14 B D 12
14 15 A C 13
The requirement is very simple. The mean has to be calcluated in the groups. So last row is group (A,C) . So it is 15+ 13(previous) + 11 (previous to previous because window is 3) = 39 /3 =13
Same for other rows.
But when I do this with level = 0 with below code I get
df['mean'] = (df.groupby(['type','type2'])['values'].rolling(window=3).mean().reset_index(level=0,drop=True))
raised in MultiIndex.from_tuples, see test_insert_error_msmgs
12690 if not value.index.is_unique:
12691 # duplicate axis
12692 raise err
12693
> 12694 raise TypeError(
12695 "incompatible index of inserted column with frame index"
12696 ) from err
12697 return reindexed_value, None
TypeError: incompatible index of inserted column with frame index
How to go about this simple requirement ?
You should use droplevel
:
cols = ['type', 'type2']
df['mean'] = (df.groupby(cols)['values']
.rolling(window=3).mean()
.droplevel(cols)
)
Output:
values type type2 mean
0 1 A C NaN
1 2 B D NaN
2 3 A C NaN
3 4 B D NaN
4 5 A C 3.0
5 6 B D 4.0
6 7 A C 5.0
7 8 B D 6.0
8 9 A C 7.0
9 10 B D 8.0
10 11 A C 9.0
11 12 B D 10.0
12 13 A C 11.0
13 14 B D 12.0
14 15 A C 13.0
And to avoid the NaNs, add min_periods=1
:
cols = ['type', 'type2']
df['mean'] = (df.groupby(cols)['values']
.rolling(window=3, min_periods=1).mean()
.droplevel(cols)
)
Output:
values type type2 mean
0 1 A C 1.0
1 2 B D 2.0
2 3 A C 2.0
3 4 B D 3.0
4 5 A C 3.0
5 6 B D 4.0
6 7 A C 5.0
7 8 B D 6.0
8 9 A C 7.0
9 10 B D 8.0
10 11 A C 9.0
11 12 B D 10.0
12 13 A C 11.0
13 14 B D 12.0
14 15 A C 13.0