Search code examples
pythonpandastime-seriesaggregation

Pandas: Average value for the past n days


I have a Pandas data frame like this:

test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
                             '2016-04-02','2016-04-03','2016-04-04',
                             '2016-04-05','2016-04-06','2016-04-06'],
                      'User' : ['Mike','John','Mike','John','Mike','Mike',
                             'Mike','Mike','John'],
                      'Value' : [1,2,1,3,4.5,1,2,3,6]
                })

As you can see below, the data set does not have observations for every day necessarily:

         Date  User  Value
0  2016-04-01  Mike    1.0
1  2016-04-01  John    2.0
2  2016-04-02  Mike    1.0
3  2016-04-02  John    3.0
4  2016-04-03  Mike    4.5
5  2016-04-04  Mike    1.0
6  2016-04-05  Mike    2.0
7  2016-04-06  Mike    3.0
8  2016-04-06  John    6.0

I'd like to add a new column which shows the average value for each user for the past n days (in this case n = 2) if at least one day is available, else it would have nan value. For example, on 2016-04-06 John gets a nan because he has no data for 2016-04-05 and 2016-04-04. So the result will be something like this:

         Date  User  Value  Value_Average_Past_2_days
0  2016-04-01  Mike    1.0                        NaN
1  2016-04-01  John    2.0                        NaN
2  2016-04-02  Mike    1.0                       1.00
3  2016-04-02  John    3.0                       2.00
4  2016-04-03  Mike    4.5                       1.00
5  2016-04-04  Mike    1.0                       2.75
6  2016-04-05  Mike    2.0                       2.75
7  2016-04-06  Mike    3.0                       1.50
8  2016-04-06  John    6.0                        NaN

It seems that I should a combination of group_by and customized rolling_mean after reading several posts in the forum, but I couldn't quite figure out how to do it.


Solution

  • I think you can use first convert column Date to_datetime, then find missing Days by groupby with resample and last apply rolling

    test['Date'] = pd.to_datetime(test['Date'])
    
    df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
    print df
                     User  Value
    User Date                   
    John 2016-04-01  John    2.0
         2016-04-02  John    3.0
         2016-04-03   NaN    NaN
         2016-04-04   NaN    NaN
         2016-04-05   NaN    NaN
         2016-04-06  John    6.0
    Mike 2016-04-01  Mike    1.0
         2016-04-02  Mike    1.0
         2016-04-03  Mike    4.5
         2016-04-04  Mike    1.0
         2016-04-05  Mike    2.0
    
    df1 = df.groupby(level=0)['Value']
            .apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
            .reset_index(name='Value_Average_Past_2_days')
    
    print df1
        User       Date  Value_Average_Past_2_days
    0   John 2016-04-01                        NaN
    1   John 2016-04-02                       2.00
    2   John 2016-04-03                       2.50
    3   John 2016-04-04                       3.00
    4   John 2016-04-05                        NaN
    5   John 2016-04-06                        NaN
    6   Mike 2016-04-01                        NaN
    7   Mike 2016-04-02                       1.00
    8   Mike 2016-04-03                       1.00
    9   Mike 2016-04-04                       2.75
    10  Mike 2016-04-05                       2.75
    11  Mike 2016-04-06                       1.50
    
    print pd.merge(test, df1, on=['Date', 'User'], how='left')
            Date  User  Value  Value_Average_Past_2_days
    0 2016-04-01  Mike    1.0                        NaN
    1 2016-04-01  John    2.0                        NaN
    2 2016-04-02  Mike    1.0                       1.00
    3 2016-04-02  John    3.0                       2.00
    4 2016-04-03  Mike    4.5                       1.00
    5 2016-04-04  Mike    1.0                       2.75
    6 2016-04-05  Mike    2.0                       2.75
    7 2016-04-06  Mike    3.0                       1.50
    8 2016-04-06  John    6.0                        NaN