Search code examples
pythonpandaswindowing

Pandas dataframe groupby multiple years rolling stat


I have a pandas dataframe for which I'm trying to compute an expanding windowed aggregation after grouping by columns. The data structure is something like this:

df = pd.DataFrame([['A',1,2015,4],['A',1,2016,5],['A',1,2017,6],['B',1,2015,10],['B',1,2016,11],['B',1,2017,12],
               ['A',1,2015,24],['A',1,2016,25],['A',1,2017,26],['B',1,2015,30],['B',1,2016,31],['B',1,2017,32],
              ['A',2,2015,4],['A',2,2016,5],['A',2,2017,6],['B',2,2015,10],['B',2,2016,11],['B',2,2017,12]],columns=['Typ','ID','Year','dat'])\
.sort_values(by=['Typ','ID','Year'])

i.e.

    Typ ID  Year    dat
0   A   1   2015    4
6   A   1   2015    24
1   A   1   2016    5
7   A   1   2016    25
2   A   1   2017    6
8   A   1   2017    26
12  A   2   2015    4
13  A   2   2016    5
14  A   2   2017    6
3   B   1   2015    10
9   B   1   2015    30
4   B   1   2016    11
10  B   1   2016    31
5   B   1   2017    12
11  B   1   2017    32
15  B   2   2015    10
16  B   2   2016    11
17  B   2   2017    12

In general, there is a completely varying number of years per Type-ID and rows per Type-ID-Year. I need to group this dataframe by the columns Type and ID, then compute an expanding windowed median & std of all observations by Year. I would like to get output results like this:

    Typ ID  Year    median  std
0   A   1   2015    14.0    14.14
1   A   1   2016    14.5    11.56
2   A   1   2017    15.0    10.99
3   A   2   2015    4.0     0
4   A   2   2016    4.5     0
5   A   2   2017    5.0     0
6   B   1   2015    20.0    14.14
7   B   1   2016    20.5    11.56
8   B   1   2017    21.0    10.99
9   B   2   2015    10.0    0
10  B   2   2016    10.5    0
11  B   2   2017    11.0    0

Hence, I want something like a groupby by ['Type','ID','Year'], with the median & std for each Type-ID-Year computed for all data with the same Type-ID and cumulative inclusive that Year.

How can I do this without manual iteration?


Solution

  • There's been no activity on this question, so I'll post the solution I found.

    mn = df.groupby(by=['Typ','ID']).dat.expanding().median().reset_index().set_index('level_2')
    mylast = lambda x: x.iloc[-1]
    mn = mn.join(df['Year'])
    mn = mn.groupby(by=['Typ','ID','Year']).agg(mylast).reset_index()
    

    My solution follows this algorithm:

    1. group the data, compute the windowed median, and get the original index back
    2. with the original index back, get the year back from the original dataframe
    3. group by the grouping columns, taking the last (in order) value for each

    This gives the output desired. The same process can be followed for the standard deviation (or any other statistic desired).