Search code examples
pythonpandasmoving-average

Pandas - moving averages - use values of previous X entries for current row


So my dataset looks like this:

date,site,iso,id,hits
2017-08-25,google,1,7012,14225.0
2017-08-26,google,1,7012,14565.0
2017-08-27,google,1,7012,14580.0
2017-08-28,google,1,7012,14227.0
2017-08-29,google,1,7012,14568.0
2017-08-30,google,1,7012,14582.0
2017-08-31,google,1,7012,14214.0
2017-09-01,google,1,7012,14053.0
2017-08-25,facebook,2,7019,21225.0
2017-08-26,facebook,2,7019,21565.0
2017-08-27,facebook,2,7019,31580.0
2017-08-28,facebook,2,7019,13227.0
2017-08-29,facebook,2,7019,22568.0
2017-08-30,facebook,2,7019,44582.0
2017-08-31,facebook,2,7019,32214.0
2017-09-01,facebook,2,7019,44053.0

I need to find the 3 day moving average of hits using the previous 3 days of data. What I've done so far is this:

#sort values
df_sorted = df.sort_values(['site', 'iso', 'id', 'date'], ascending=[1, 1, 1, 1])
#group
df_grouped_sorted = df_sorted.groupby(['site', 'iso', 'id'], as_index=False)
df_sorted['mov_av_hits'] = df_grouped_sorted[['hits']].rolling(3, min_periods=3).mean().fillna(0).reset_index(
    0, drop=True)

And now my dataset looks like this:

date,site,iso,id,hits,hits_avg
2017-08-25,google,1,7012,14225.0,14146.0
2017-08-26,google,1,7012,14565.0,14338.6666667
2017-08-27,google,1,7012,14580.0,14456.6666667
2017-08-28,google,1,7012,14227.0,14457.3333333
2017-08-29,google,1,7012,14568.0,14458.3333333
2017-08-30,google,1,7012,14582.0,14459.0
2017-08-31,google,1,7012,14214.0,14454.6666667
2017-09-01,google,1,7012,14053.0,14283.0

Now the problem with this approach is that the current value of hits is included in the moving average.

Take for example, 2017-09-01. The value I need for hits_avg is (14568+14582+14214)/3 = 14454.6 but what I get is (14582+14214+14053)/3 = 14283.0

How can I get the moving average to take values from the previous 3 days?


Solution

  • You can try this ..

    df.hits.shift().rolling(3,min_periods=1).mean().fillna(df.hits)
    Out[692]: 
    0    14225.000000
    1    14225.000000
    2    14395.000000
    3    14456.666667
    4    14457.333333
    5    14458.333333
    6    14459.000000
    7    14454.666667
    Name: hits, dtype: float64
    

    Update

    df['new']=df.groupby('site').hits.apply(lambda x : x.shift().rolling(3,min_periods=1).mean().fillna(x))
    
    
    df
    Out[712]: 
              date      site  iso    id     hits           new
    0   2017-08-25    google    1  7012  14225.0  14225.000000
    1   2017-08-26    google    1  7012  14565.0  14225.000000
    2   2017-08-27    google    1  7012  14580.0  14395.000000
    3   2017-08-28    google    1  7012  14227.0  14456.666667
    4   2017-08-29    google    1  7012  14568.0  14457.333333
    5   2017-08-30    google    1  7012  14582.0  14458.333333
    6   2017-08-31    google    1  7012  14214.0  14459.000000
    7   2017-09-01    google    1  7012  14053.0  14454.666667
    8   2017-08-25  facebook    2  7019  21225.0  21225.000000
    9   2017-08-26  facebook    2  7019  21565.0  21225.000000
    10  2017-08-27  facebook    2  7019  31580.0  21395.000000
    11  2017-08-28  facebook    2  7019  13227.0  24790.000000
    12  2017-08-29  facebook    2  7019  22568.0  22124.000000
    13  2017-08-30  facebook    2  7019  44582.0  22458.333333
    14  2017-08-31  facebook    2  7019  32214.0  26792.333333
    15  2017-09-01  facebook    2  7019  44053.0  33121.333333