I have a large dataset for which I need to calculate some statistics over a rolling time window of N days. There are multiple entries each day. And I need to calculate the statistics for all rows in the N day timeframe i.e. on a day by day basis on both ends.
The pandas.DataFrame.rolling() method with the time offset almost works. This method's time offset looks backwards to include all the days in the N day timeframe but looking forward, the window rolls by row i.e. not until the end of the day.
I have written a loop to do this but it runs pretty slow on my large dataset. Is there an efficient way to do this with rolling(), using some kind of indexer, or some other way?
The test statistics I need to calculate are count, mean, and std. My revised example:
#Create test dataframe:
import pandas as pd
l1=[1,2,3,4,5,6,7,8,9,10,11,12]
ts=[pd.Timestamp('2017-01-01'),
pd.Timestamp('2017-01-02'),
pd.Timestamp('2017-01-03'),
pd.Timestamp('2017-01-04'),
pd.Timestamp('2017-01-05'),
pd.Timestamp('2017-01-05'),
pd.Timestamp('2017-01-05'),
pd.Timestamp('2017-01-06'),
pd.Timestamp('2017-01-06'),
pd.Timestamp('2017-01-07'),
pd.Timestamp('2017-01-07'),
pd.Timestamp('2017-01-08')]
In [58]: df=pd.DataFrame({'t':ts, 'c':l1, 'm':l1, 's':l1}).set_index('t')
In [59]: df
Out[59]:
c m s
t
2017-01-01 1 1 1
2017-01-02 2 2 2
2017-01-03 3 3 3
2017-01-04 4 4 4
2017-01-05 5 5 5
2017-01-05 6 6 6
2017-01-05 7 7 7
2017-01-06 8 8 8
2017-01-06 9 9 9
2017-01-07 10 10 10
2017-01-07 11 11 11
2017-01-08 12 12 12
In [60]: df.rolling('3D').agg({'c':'count', 'm':'mean', 's':'std'})
Out[60]:
c m s
t
2017-01-01 1.0 1.0 NaN
2017-01-02 2.0 1.5 0.707107
2017-01-03 3.0 2.0 1.000000
2017-01-04 3.0 3.0 1.000000
2017-01-05 3.0 4.0 1.000000 #incorrect for day-end window boundary
2017-01-05 4.0 4.5 1.290994 #incorrect for day-end window boundary
2017-01-05 5.0 5.0 1.581139
2017-01-06 5.0 6.0 1.581139 #incorrect for day-end window boundary
2017-01-06 6.0 6.5 1.870829
2017-01-07 6.0 7.5 1.870829 #incorrect for day-end window boundary
2017-01-07 7.0 8.0 2.160247
2017-01-08 5.0 10.0 1.581139
But calculated on a window that ends with all rows of the same day, the result would be:
c m s
t
2017-01-01 1.0 1.0 NaN
2017-01-02 2.0 1.5 0.707107
2017-01-03 3.0 2.0 1.000000
2017-01-04 3.0 3.0 1.000000
2017-01-05 5.0 5.0 1.581139 #This is what it should be
2017-01-05 5.0 5.0 1.581139 #This is what it should be
2017-01-05 5.0 5.0 1.581139
2017-01-06 6.0 6.5 1.870829 #This is what it should be
2017-01-06 6.0 6.5 1.870829
2017-01-07 7.0 8.0 2.160247 #This is what it should be
2017-01-07 7.0 8.0 2.160247
2017-01-08 5.0 10.0 1.581139
Note that the last row for each day is correct having the correct answer for all other rows with the same day.
If your data is always positive, you can transform after rolling:
# if your index is not always on the day, e.g. 2017-01-01 01:00:00
# use `pd.Grouper(freq='D')` instead of `level`
df.rolling('3D').sum().groupby(level='t').transform('max')
Output:
a
t
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 3.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-06 6.0
2017-01-06 6.0
2017-01-07 7.0
2017-01-07 7.0
2017-01-08 5.0
Edit: In the general case, aggregate by the day and map back:
s = df.groupby(pd.Grouper(freq='D')).sum().rolling('3D').sum()
df.index.floor('D').to_series().map(s['a'])
output:
t
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 3.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-06 6.0
2017-01-06 6.0
2017-01-07 7.0
2017-01-07 7.0
2017-01-08 5.0
Name: t, dtype: float64