This is an extension of my previous question.
There I asked about combining groupby with rolling aggregations. @mozway was very kind to provide an elegant solution. However, I forgot to mention that I need to account for gaps in the data.
This is monthly data for each customer across several years.
In the snippet below is the same data for customer 2 we don't have the data for period '200102' (February 2001). In the original question I simply denoted the periods as 1,2,3... and the data for all periods was there.
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13]})
The output would be:
out=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'max_vol_3' : [None, None, 3,4,5,6,None,None,9,10,None],
'max_vol_6' :[None,None,None,None,None,6,None,None,None,None,None],
'sum_trans_3': [None, None, 12, 15, 18, 21, None, None, 30, 33, None]})
How to do that?
The solution in the linked post does compute rolling aggregations by simply taking into account the previous rows. But, unfortunately, there are gaps in the data for some of the periods.
EDIT:
I realise that my wording was vague.
Essentially, I would like to apply a minimum threshold - if, say, 1 value is missing from the 3 month window then the statistics for that 3-month period is set to None.
I would be enforcing that at least 3 months are used for 3 month summaries and at least 5 months for 6 months summaries.
I gave it a try, it's actually far from simple. You can't use monthly periods in rolling as those are not fixed (29 to 31 days). So I cheated using seconds are months instead.
I wrapped everything in a function for ease of use:
def cust_rolling(df, values_col, period_col, group_col, windows, functions, format='%Y%m', freq='M'):
# we can't use monthly periods in rolling, fake it using seconds
p = pd.to_datetime(df[period_col], format=format).dt.to_period(freq)
df['tmp_period'] = pd.to_timedelta(p.sub(p.min()).apply(lambda x: x.n), unit='s')
# set up rolling
r = (df.set_index('tmp_period')
.groupby(group_col)[values_col]
.rolling
)
# compute for each function and each window
tmp = pd.concat({f'{f}_{values_col}_{x}': r(f'{x}s').agg(f).groupby(group_col).tail(1-x)
for f in functions for x in windows}, axis=1)
# merge to original data
return df.merge(tmp.reset_index(), how='left').drop(columns='tmp_period')
out = (df
.pipe(cust_rolling, 'volume', 'period', 'cust_id', [3, 6], ['max'])
.pipe(cust_rolling, 'num_transactions', 'period', 'cust_id', [3], ['sum'])
)
Output:
cust_id period volume num_transactions max_volume_3 max_volume_6 sum_num_transactions_3
0 1 200010 1 3 NaN NaN NaN
1 1 200011 2 4 NaN NaN NaN
2 1 200012 3 5 3.0 NaN 12.0
3 1 200101 4 6 4.0 NaN 15.0
4 1 200102 5 7 5.0 NaN 18.0
5 1 200103 6 8 6.0 6.0 21.0
6 2 200010 7 9 NaN NaN NaN
7 2 200011 8 10 NaN NaN NaN
8 2 200012 9 11 9.0 NaN 30.0
9 2 200101 10 12 10.0 NaN 33.0
10 2 200103 12 13 12.0 NaN 25.0
def cust_rolling(df, values_col, period_col, group_col, windows, functions, format='%Y%m', freq='M'):
# we can't use monthly periods in rolling, fake it using seconds
p = pd.to_datetime(df[period_col], format=format).dt.to_period(freq)
df['tmp_period'] = pd.to_timedelta(p.sub(p.min()).apply(lambda x: x.n), unit='s')
# set up rolling
r = (df.set_index('tmp_period')
.groupby(group_col)[values_col]
.rolling
)
tmp = pd.concat({f'{f.__name__}_{values_col}_{x}': r(f'{x}s').agg(f).groupby(group_col).tail(1-x)
for f in functions for x in windows}, axis=1)
return df.merge(tmp.reset_index(), how='left').drop(columns='tmp_period')
def max2(s):
if s.index.to_series().diff().gt('1s').any():
return np.nan
return s.max()
def sum2(s):
if s.index.to_series().diff().gt('1s').any():
return np.nan
return s.sum()
out = (df
.pipe(cust_rolling, 'volume', 'period', 'cust_id', [3, 6], [max2])
.pipe(cust_rolling, 'num_transactions', 'period', 'cust_id', [3], [sum2])
)
Output:
cust_id period volume num_transactions max2_volume_3 max2_volume_6 sum2_num_transactions_3
0 1 200010 1 3 NaN NaN NaN
1 1 200011 2 4 NaN NaN NaN
2 1 200012 3 5 3.0 NaN 12.0
3 1 200101 4 6 4.0 NaN 15.0
4 1 200102 5 7 5.0 NaN 18.0
5 1 200103 6 8 6.0 6.0 21.0
6 2 200010 7 9 NaN NaN NaN
7 2 200011 8 10 NaN NaN NaN
8 2 200012 9 11 9.0 NaN 30.0
9 2 200101 10 12 10.0 NaN 33.0
10 2 200103 12 13 NaN NaN NaN