Search code examples
pythonpandaswindowing

Pandas Windowing with Groupby not Working as Expected


I have a pandas dataframe for which I'm trying to compute an expanding windowed aggregation after grouping by columns. The data structure is something like this:

df = pd.DataFrame([['A',1,2015,4],['A',1,2016,5],['A',1,2017,6],['B',1,2015,10],['B',1,2016,11],['B',1,2017,12],
               ['A',1,2015,24],['A',1,2016,25],['A',1,2017,26],['B',1,2015,30],['B',1,2016,31],['B',1,2017,32],
              ['A',2,2015,4],['A',2,2016,5],['A',2,2017,6],['B',2,2015,10],['B',2,2016,11],['B',2,2017,12]],columns=['Typ','ID','Year','dat'])\
.sort_values(by=['Typ','ID','Year'])

i.e.

    Typ ID  Year    dat
0   A   1   2015    4
6   A   1   2015    24
1   A   1   2016    5
7   A   1   2016    25
2   A   1   2017    6
8   A   1   2017    26
12  A   2   2015    4
13  A   2   2016    5
14  A   2   2017    6
3   B   1   2015    10
9   B   1   2015    30
4   B   1   2016    11
10  B   1   2016    31
5   B   1   2017    12
11  B   1   2017    32
15  B   2   2015    10
16  B   2   2016    11
17  B   2   2017    12

I need to group this dataframe by the columns Type and ID, then compute an expanding mean of the all observations by Year. The code I've written is

df.groupby(by=['Typ','ID','Year']).expanding().mean().reset_index()

from which I expect output results like this (ignoring level_3):

    Typ ID  Year    level_3 dat
0   A   1   2015    6   14.0
1   A   1   2016    7   14.5
2   A   1   2017    8   15.0
3   A   2   2015    12  4.0
4   A   2   2016    13  4.5
5   A   2   2017    14  5.0
6   B   1   2015    9   20.0
7   B   1   2016    10  20.5
8   B   1   2017    11  21.0
9   B   2   2015    15  10.0
10  B   2   2016    16  10.5
11  B   2   2017    17  11.0

Grouping by ['Type','ID','Year'] should result in a single row for each unique row of these columns. Instead, the code is giving this:

Typ ID  Year    level_3 dat
0   A   1   2015    0   4.0
1   A   1   2015    6   14.0
2   A   1   2016    1   5.0
3   A   1   2016    7   15.0
4   A   1   2017    2   6.0
5   A   1   2017    8   16.0
6   A   2   2015    12  4.0
7   A   2   2016    13  5.0
8   A   2   2017    14  6.0
9   B   1   2015    3   10.0
10  B   1   2015    9   20.0
11  B   1   2016    4   11.0
12  B   1   2016    10  21.0
13  B   1   2017    5   12.0
14  B   1   2017    11  22.0
15  B   2   2015    15  10.0
16  B   2   2016    16  11.0
17  B   2   2017    17  12.0

The expanding() windowing function does not seem to be working with the groupby correctly, or at least it is not behaving as I expect, given the logic. What am I doing wrong?

Edit: I see now what I'm doing wrong, in that I was expecting different integration between groupby and expanding. So now my question is how can I use pandas to get the output I want - without any manual iteration.


Solution

  • Expanding mean to my knowledge has a different calculation way. For the output you want, I would do the following using a combination of groupby and cumsum, and later a simple division between sum and count:

    newDf = df.groupby(['Typ','ID','Year'])['dat'].agg(('sum', 'count')).groupby(['Typ','ID']).cumsum()
    newDf['dat'] = newDf['sum']/newDf['count']
    newDf = newDf.reset_index().drop(['count', 'sum'], axis = 1)
    

    Output:

       Typ  ID  Year   dat
    0    A   1  2015  14.0
    1    A   1  2016  14.5
    2    A   1  2017  15.0
    3    A   2  2015   4.0
    4    A   2  2016   4.5
    5    A   2  2017   5.0
    6    B   1  2015  20.0
    7    B   1  2016  20.5
    8    B   1  2017  21.0
    9    B   2  2015  10.0
    10   B   2  2016  10.5
    11   B   2  2017  11.0