Search code examples
pythonpandasgroup-by

Pandas rolling averages for dates after group by


df = pd.DataFrame(
    {"date": [pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-05")],
    "numbers": [1,2,4,4,11,7,5],
    "grouper": [1, 0, 1, 0,1, 0, 0]
    }
)

If I have the following df and I would like to get the rolling mean for the values of numbers that are before each rows date column, how would I do that? eg. the rolling averages for the past 3 days for grouped by ["grouper", "date"]

I know I can do something like this, but not even close to a solution -

I am looking to build on this solution

df["av"] = df.shift(1).rolling(window=3).mean()

but this does not shift dynamically so it includes today.

My expected output for the new av column for a 3 day window grouped by the two columns of the sample df would be

    date    numbers grouper av
0   2022-01-01  1   1   NaN
1   2022-01-01  2   0   NaN
2   2022-01-01  4   1   NaN
3   2022-01-03  4   0   2.0
4   2022-01-03  11  1   2.5
5   2022-01-03  7   0   2.0
6   2022-01-05  5   0   5.5

Solution

  • You need average by definition - sum / count.

    df1 = (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             .unstack()
             .asfreq('d', fill_value=0)
             .rolling(window=3, min_periods=1)
             .sum()
             .shift()
             .stack()
             )
    
    df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
    print (df)
            date  numbers  grouper   aw
    0 2022-01-01        1        1  NaN
    1 2022-01-01        2        0  NaN
    2 2022-01-01        4        1  NaN
    3 2022-01-03        4        0  2.0
    4 2022-01-03       11        1  2.5
    5 2022-01-03        7        0  2.0
    6 2022-01-05        5        0  5.5
    

    Explanation:

    First aggregate sum and count by GroupBy.size:

    print (df.groupby(['date','grouper'])['numbers']
                 .agg(['sum','size'])
                 )
                        sum  size
    date       grouper           
    2022-01-01 0          2     1
               1          5     2
    2022-01-03 0         11     2
               1         11     1
    2022-01-05 0          5     1
    

    Then for DatatimeIndex reshape by DataFrame.unstack:

    print (df.groupby(['date','grouper'])['numbers']
                 .agg(['sum','size'])
                 .unstack()
                 )
                 sum       size     
    grouper        0     1    0    1
    date                            
    2022-01-01   2.0   5.0  1.0  2.0
    2022-01-03  11.0  11.0  2.0  1.0
    2022-01-05   5.0   NaN  1.0  NaN
    

    Add missing consecutive datetimes by DataFrame.asfreq:

    print (df.groupby(['date','grouper'])['numbers']
                 .agg(['sum','size'])
                 .unstack()
                 .asfreq('d', fill_value=0)
                 )
                 sum       size     
    grouper        0     1    0    1
    date                            
    2022-01-01   2.0   5.0  1.0  2.0
    2022-01-02   0.0   0.0  0.0  0.0
    2022-01-03  11.0  11.0  2.0  1.0
    2022-01-04   0.0   0.0  0.0  0.0
    2022-01-05   5.0   NaN  1.0  NaN
    

    Then use rolling with sum (processing sum and counts):

    print (df.groupby(['date','grouper'])['numbers']
                 .agg(['sum','size'])
                 .unstack()
                 .asfreq('d', fill_value=0)
                 .rolling(window=3, min_periods=1)
                 .sum()
                 )
                 sum       size     
    grouper        0     1    0    1
    date                            
    2022-01-01   2.0   5.0  1.0  2.0
    2022-01-02   2.0   5.0  1.0  2.0
    2022-01-03  13.0  16.0  3.0  3.0
    2022-01-04  11.0  11.0  2.0  1.0
    2022-01-05  16.0  11.0  3.0  1.0
    

    Use DataFrame.shift:

    print (df.groupby(['date','grouper'])['numbers']
                 .agg(['sum','size'])
                 .unstack()
                 .asfreq('d', fill_value=0)
                 .rolling(window=3, min_periods=1)
                 .sum()
                 .shift()
                 )
                 sum       size     
    grouper        0     1    0    1
    date                            
    2022-01-01   NaN   NaN  NaN  NaN
    2022-01-02   2.0   5.0  1.0  2.0
    2022-01-03   2.0   5.0  1.0  2.0
    2022-01-04  13.0  16.0  3.0  3.0
    2022-01-05  11.0  11.0  2.0  1.0
    

    Reshape back by DataFrame.stack:

    print (df.groupby(['date','grouper'])['numbers']
                 .agg(['sum','size'])
                 .unstack()
                 .asfreq('d', fill_value=0)
                 .rolling(window=3, min_periods=1)
                 .sum()
                 .shift()
                 .stack()
                 )
                         sum  size
    date       grouper            
    2022-01-02 0         2.0   1.0
               1         5.0   2.0
    2022-01-03 0         2.0   1.0
               1         5.0   2.0
    2022-01-04 0        13.0   3.0
               1        16.0   3.0
    2022-01-05 0        11.0   2.0
               1        11.0   1.0
    

    For averages divide columns:

    print (df1['sum'].div(df1['size']).rename('aw'))
    date        grouper
    2022-01-02  0           2.000000
                1           2.500000
    2022-01-03  0           2.000000
                1           2.500000
    2022-01-04  0           4.333333
                1           5.333333
    2022-01-05  0           5.500000
                1          11.000000
    Name: aw, dtype: float64
    

    And append to original:

    df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
    print (df)
            date  numbers  grouper   aw
    0 2022-01-01        1        1  NaN
    1 2022-01-01        2        0  NaN
    2 2022-01-01        4        1  NaN
    3 2022-01-03        4        0  2.0
    4 2022-01-03       11        1  2.5
    5 2022-01-03        7        0  2.0
    6 2022-01-05        5        0  5.5