Search code examples
pythonpandasgroup-byoffsetrolling-computation

Pandas monthly rolling window


I am looking to do a 'monthly' rolling window on daily data grouped by a category. The code below does not work as is, it leads to the following error:

ValueError: <DateOffset: months=1> is a non-fixed frequency

I know that I could use '30D' offset, however this would shift the date over time.

I'm looking for the sum of a window that spans from the x-th day of a month to that same x-th day of the J-th month. E.g. with J=1: 4th of July to 4th of August, 5th of July to 5th of August, 6th of July to 6th of August etc

I've been trying to figure this out for a few days now. Any suggestions or tipps would be very appreciated. Happy New Year.

MRE:

import pandas as pd
from io import StringIO

data = StringIO(
"""\
date          logret       category
2014-03-25    -0.01           A
2014-04-05    -0.02           A
2014-04-15    -0.03           A
2014-04-25    0.01            B
2014-05-05    0.03            B
2014-05-15    -0.01           A
2014-05-25    0.04            B
"""
)

df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")

J=1

df.groupby(['category'])['logret'].rolling(pd.DateOffset(months=J),min_periods=J*20).sum() 

Solution

  • In an intermediary step 'normalize' your timestamps, such that every month has 31 days, then aggregate, and finally drop the 'inserted' rows from your result.

    That works as long as your aggregation has a neutral element.

    1. create an index from the original df with all timestamps as strings
    2. create another index with strings representing timestamps where all months have 31 days
    3. merge, aggregate, etc.
    4. select from aggregation by the index derived from original df
    5. add new column to original df
    import pandas as pd
    from io import StringIO
    
    data = StringIO(
    """\
    date          logret       category
    2014-03-25    -0.01           A
    2014-04-05    -0.02           A
    2014-04-15    -0.03           A
    2014-04-25    0.01            B
    2014-05-05    0.03            B
    2014-05-15    -0.01           A
    2014-05-25    0.04            B
    """
    )
    
    df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")
    idx = df.index.strftime('%Y-%m-%d')
    
    y0 = df.index[0].year
    y1 = df.index[-1].year
    
    padded = pd.DataFrame(index=[f'{y}-{m:02}-{d:02}' 
                                 for y in range(y0,y1+1) 
                                 for m in range(1, 13)
                                 for d in range(1, 32)])[idx[0]:idx[-1]]
    
    # Note that the rolling interval is exclusive at start
    df.assign(rolling_aggregate=padded.join(df.set_index(idx)).fillna(0).rolling(31).agg(sum).loc[idx])
    

    yields:

                logret category  rolling_aggregate
    date                                          
    2014-03-25   -0.01        A                NaN
    2014-04-05   -0.02        A                NaN
    2014-04-15   -0.03        A                NaN
    2014-04-25    0.01        B              -0.04
    2014-05-05    0.03        B               0.01
    2014-05-15   -0.01        A               0.03
    2014-05-25    0.04        B               0.06