So my dataset looks like this:
date,site,iso,id,hits
2017-08-25,google,1,7012,14225.0
2017-08-26,google,1,7012,14565.0
2017-08-27,google,1,7012,14580.0
2017-08-28,google,1,7012,14227.0
2017-08-29,google,1,7012,14568.0
2017-08-30,google,1,7012,14582.0
2017-08-31,google,1,7012,14214.0
2017-09-01,google,1,7012,14053.0
2017-08-25,facebook,2,7019,21225.0
2017-08-26,facebook,2,7019,21565.0
2017-08-27,facebook,2,7019,31580.0
2017-08-28,facebook,2,7019,13227.0
2017-08-29,facebook,2,7019,22568.0
2017-08-30,facebook,2,7019,44582.0
2017-08-31,facebook,2,7019,32214.0
2017-09-01,facebook,2,7019,44053.0
I need to find the 3 day moving average of hits using the previous 3 days of data. What I've done so far is this:
#sort values
df_sorted = df.sort_values(['site', 'iso', 'id', 'date'], ascending=[1, 1, 1, 1])
#group
df_grouped_sorted = df_sorted.groupby(['site', 'iso', 'id'], as_index=False)
df_sorted['mov_av_hits'] = df_grouped_sorted[['hits']].rolling(3, min_periods=3).mean().fillna(0).reset_index(
0, drop=True)
And now my dataset looks like this:
date,site,iso,id,hits,hits_avg
2017-08-25,google,1,7012,14225.0,14146.0
2017-08-26,google,1,7012,14565.0,14338.6666667
2017-08-27,google,1,7012,14580.0,14456.6666667
2017-08-28,google,1,7012,14227.0,14457.3333333
2017-08-29,google,1,7012,14568.0,14458.3333333
2017-08-30,google,1,7012,14582.0,14459.0
2017-08-31,google,1,7012,14214.0,14454.6666667
2017-09-01,google,1,7012,14053.0,14283.0
Now the problem with this approach is that the current value of hits
is included in the moving average.
Take for example, 2017-09-01. The value I need for hits_avg
is (14568+14582+14214)/3 = 14454.6
but what I get is (14582+14214+14053)/3 = 14283.0
How can I get the moving average to take values from the previous 3 days?
You can try this ..
df.hits.shift().rolling(3,min_periods=1).mean().fillna(df.hits)
Out[692]:
0 14225.000000
1 14225.000000
2 14395.000000
3 14456.666667
4 14457.333333
5 14458.333333
6 14459.000000
7 14454.666667
Name: hits, dtype: float64
Update
df['new']=df.groupby('site').hits.apply(lambda x : x.shift().rolling(3,min_periods=1).mean().fillna(x))
df
Out[712]:
date site iso id hits new
0 2017-08-25 google 1 7012 14225.0 14225.000000
1 2017-08-26 google 1 7012 14565.0 14225.000000
2 2017-08-27 google 1 7012 14580.0 14395.000000
3 2017-08-28 google 1 7012 14227.0 14456.666667
4 2017-08-29 google 1 7012 14568.0 14457.333333
5 2017-08-30 google 1 7012 14582.0 14458.333333
6 2017-08-31 google 1 7012 14214.0 14459.000000
7 2017-09-01 google 1 7012 14053.0 14454.666667
8 2017-08-25 facebook 2 7019 21225.0 21225.000000
9 2017-08-26 facebook 2 7019 21565.0 21225.000000
10 2017-08-27 facebook 2 7019 31580.0 21395.000000
11 2017-08-28 facebook 2 7019 13227.0 24790.000000
12 2017-08-29 facebook 2 7019 22568.0 22124.000000
13 2017-08-30 facebook 2 7019 44582.0 22458.333333
14 2017-08-31 facebook 2 7019 32214.0 26792.333333
15 2017-09-01 facebook 2 7019 44053.0 33121.333333