Search code examples
pythonpandascumsumrolling-computation

Apply cumxxx (sum, min...) to a window of varying size in a DataFrame


I would like to apply a cumxxx operation on a window of varying size in a DataFrame. Considering following inputs:

import pandas as pd
from random import seed, randint
from collections import OrderedDict

p5h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='5h', name='p5h')
p1h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

p5h_st_as_series = p5h.start_time.to_series()

df['OpeneningPeriod'] = df.apply(
              lambda x: p5h.to_series().loc[p5h_st_as_series.index <=
                                            x.name.start_time].index[-1],
                                 axis=1)

Result

df.head(20)
                  Values   OpeneningPeriod
p1h                                       
2020-02-01 00:00       2  2020-02-01 00:00
2020-02-01 01:00       9  2020-02-01 00:00
2020-02-01 02:00       1  2020-02-01 00:00
2020-02-01 03:00       4  2020-02-01 00:00
2020-02-01 04:00       1  2020-02-01 00:00
2020-02-01 05:00       7  2020-02-01 05:00
2020-02-01 06:00       7  2020-02-01 05:00
2020-02-01 07:00       7  2020-02-01 05:00
2020-02-01 08:00      10  2020-02-01 05:00
2020-02-01 09:00       6  2020-02-01 05:00
2020-02-01 10:00       3  2020-02-01 10:00
2020-02-01 11:00       1  2020-02-01 10:00
2020-02-01 12:00       7  2020-02-01 10:00
2020-02-01 13:00       0  2020-02-01 10:00
2020-02-01 14:00       6  2020-02-01 10:00
2020-02-01 15:00       6  2020-02-01 15:00
2020-02-01 16:00       9  2020-02-01 15:00
2020-02-01 17:00       0  2020-02-01 15:00
2020-02-01 18:00       7  2020-02-01 15:00
2020-02-01 19:00       4  2020-02-01 15:00

Here, the cumxxx is to be applied on the 5-hour period defined. It can be of varying length because windows can be a day period (some with DST), or a month period (not constant number of hours in a month).

Result I am looking for is:

df_result.head(11)
                  Values   OpeneningPeriod   Cumsum
p1h                                       
2020-02-01 00:00       2  2020-02-01 00:00        2  <- cumsum starts with a new period
2020-02-01 01:00       9  2020-02-01 00:00       11
2020-02-01 02:00       1  2020-02-01 00:00       12
2020-02-01 03:00       4  2020-02-01 00:00       16
2020-02-01 04:00       1  2020-02-01 00:00       17
2020-02-01 05:00       7  2020-02-01 05:00        7  <- cumsum starts with a new period
2020-02-01 06:00       7  2020-02-01 05:00       14
2020-02-01 07:00       7  2020-02-01 05:00       21
2020-02-01 08:00      10  2020-02-01 05:00       31
2020-02-01 09:00       6  2020-02-01 05:00       37
2020-02-01 10:00       3  2020-02-01 10:00        3  <- cumsum starts with a new period

It would be the same logic with cummin & cummax. Does anyone has any idea?

Thanks for any help! Bests,


Solution

  • If need grouping by 5H window by DatetimeIndex use DataFrame.to_period with cumsum:

    df['Cumsum'] = df.resample('5H')['Values'].cumsum()
    

    Or Grouper:

    df['Cumsum'] = df.groupby(pd.Grouper(freq='5H'))['Values'].cumsum()
    

    print (df.head(11))
                      Values   OpeneningPeriod  Cumsum
    p1h                                               
    2020-02-01 00:00       2  2020-02-01 00:00       2
    2020-02-01 01:00       9  2020-02-01 00:00      11
    2020-02-01 02:00       1  2020-02-01 00:00      12
    2020-02-01 03:00       4  2020-02-01 00:00      16
    2020-02-01 04:00       1  2020-02-01 00:00      17
    2020-02-01 05:00       7  2020-02-01 05:00       7
    2020-02-01 06:00       7  2020-02-01 05:00      14
    2020-02-01 07:00       7  2020-02-01 05:00      21
    2020-02-01 08:00      10  2020-02-01 05:00      31
    2020-02-01 09:00       6  2020-02-01 05:00      37
    2020-02-01 10:00       3  2020-02-01 10:00       3