Search code examples
pandasgroup-by

Pandas Rolling Aggregations that take into account missing values


This is an extension of my previous question.

There I asked about combining groupby with rolling aggregations. @mozway was very kind to provide an elegant solution. However, I forgot to mention that I need to account for gaps in the data.

This is monthly data for each customer across several years.

In the snippet below is the same data for customer 2 we don't have the data for period '200102' (February 2001). In the original question I simply denoted the periods as 1,2,3... and the data for all periods was there.

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13]})

The output would be:

out=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                   'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                  'max_vol_3' : [None, None, 3,4,5,6,None,None,9,10,None],
                  'max_vol_6' :[None,None,None,None,None,6,None,None,None,None,None],
                  'sum_trans_3': [None, None, 12, 15, 18, 21, None, None, 30, 33, None]})

How to do that?

The solution in the linked post does compute rolling aggregations by simply taking into account the previous rows. But, unfortunately, there are gaps in the data for some of the periods.

EDIT:

I realise that my wording was vague.

Essentially, I would like to apply a minimum threshold - if, say, 1 value is missing from the 3 month window then the statistics for that 3-month period is set to None.

I would be enforcing that at least 3 months are used for 3 month summaries and at least 5 months for 6 months summaries.


Solution

  • I gave it a try, it's actually far from simple. You can't use monthly periods in rolling as those are not fixed (29 to 31 days). So I cheated using seconds are months instead.

    I wrapped everything in a function for ease of use:

    
    def cust_rolling(df, values_col, period_col, group_col, windows, functions, format='%Y%m', freq='M'):
        # we can't use monthly periods in rolling, fake it using seconds
        p = pd.to_datetime(df[period_col], format=format).dt.to_period(freq)
        df['tmp_period'] = pd.to_timedelta(p.sub(p.min()).apply(lambda x: x.n), unit='s')
    
        # set up rolling
        r = (df.set_index('tmp_period')
               .groupby(group_col)[values_col]
               .rolling
             )
    
        # compute for each function and each window
        tmp = pd.concat({f'{f}_{values_col}_{x}': r(f'{x}s').agg(f).groupby(group_col).tail(1-x)
                         for f in functions for x in windows}, axis=1)
        # merge to original data
        return df.merge(tmp.reset_index(), how='left').drop(columns='tmp_period')
    
    out = (df
       .pipe(cust_rolling, 'volume', 'period', 'cust_id', [3, 6], ['max'])
       .pipe(cust_rolling, 'num_transactions', 'period', 'cust_id', [3], ['sum'])
    )
    

    Output:

        cust_id  period  volume  num_transactions  max_volume_3  max_volume_6  sum_num_transactions_3
    0         1  200010       1                 3           NaN           NaN                     NaN
    1         1  200011       2                 4           NaN           NaN                     NaN
    2         1  200012       3                 5           3.0           NaN                    12.0
    3         1  200101       4                 6           4.0           NaN                    15.0
    4         1  200102       5                 7           5.0           NaN                    18.0
    5         1  200103       6                 8           6.0           6.0                    21.0
    6         2  200010       7                 9           NaN           NaN                     NaN
    7         2  200011       8                10           NaN           NaN                     NaN
    8         2  200012       9                11           9.0           NaN                    30.0
    9         2  200101      10                12          10.0           NaN                    33.0
    10        2  200103      12                13          12.0           NaN                    25.0
    

    forcing NaN upon missing periods

    def cust_rolling(df, values_col, period_col, group_col, windows, functions, format='%Y%m', freq='M'):
        # we can't use monthly periods in rolling, fake it using seconds
        p = pd.to_datetime(df[period_col], format=format).dt.to_period(freq)
        df['tmp_period'] = pd.to_timedelta(p.sub(p.min()).apply(lambda x: x.n), unit='s')
    
        # set up rolling
        r = (df.set_index('tmp_period')
               .groupby(group_col)[values_col]
               .rolling
             )
        tmp = pd.concat({f'{f.__name__}_{values_col}_{x}': r(f'{x}s').agg(f).groupby(group_col).tail(1-x)
                         for f in functions for x in windows}, axis=1)
        return df.merge(tmp.reset_index(), how='left').drop(columns='tmp_period')
    
    def max2(s):
        if s.index.to_series().diff().gt('1s').any():
            return np.nan
        return s.max()
    
    def sum2(s):
        if s.index.to_series().diff().gt('1s').any():
            return np.nan
        return s.sum()
    
    out = (df
       .pipe(cust_rolling, 'volume', 'period', 'cust_id', [3, 6], [max2])
       .pipe(cust_rolling, 'num_transactions', 'period', 'cust_id', [3], [sum2])
    )
    

    Output:

        cust_id  period  volume  num_transactions  max2_volume_3  max2_volume_6  sum2_num_transactions_3
    0         1  200010       1                 3            NaN            NaN                      NaN
    1         1  200011       2                 4            NaN            NaN                      NaN
    2         1  200012       3                 5            3.0            NaN                     12.0
    3         1  200101       4                 6            4.0            NaN                     15.0
    4         1  200102       5                 7            5.0            NaN                     18.0
    5         1  200103       6                 8            6.0            6.0                     21.0
    6         2  200010       7                 9            NaN            NaN                      NaN
    7         2  200011       8                10            NaN            NaN                      NaN
    8         2  200012       9                11            9.0            NaN                     30.0
    9         2  200101      10                12           10.0            NaN                     33.0
    10        2  200103      12                13            NaN            NaN                      NaN