I have customer data for individual periods. For some customers for some periods there are missing values. I would like to impute these values according to the following rule:
If the values for the preceding 2 periods and the following 2 periods exist, replace the missing value with the average of these 4 values. Also, I would like to set an indicator flag that this value has been imputed.
Here is the starting point:
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2,2],
'period' : [1,2,3,4,5,6,1,2,3,4,5,6],
'volume' : [1,2,NaN,4,5,6,7,NaN,9,Nan,11,Nan]})
In the dataframe above for customer1 there is a missing value for period 3. According to the above rule it can be filled with the average of the 2 values before it and the two values that follow it.
For customer2 this rule won't change anything - there are 3 missing values in his case but there are not enough values on either side of each NaN value to compute the average.
So the end result will be:
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2,2],
'period' : [1,2,3,4,5,6,1,2,3,4,5,6],
'volume' : [1,2,6,4,5,6,7,NaN,9,Nan,11,Nan],
'imputed' : [0,0,1,0,0,0,0,0,0,0,0,0]})
How to do that?
The following function does what is needed:
def impute_means (df, feat, min_per):
imp_feat = f'imputed_{feat}'
imputed = df.groupby('cust_id')[feat].rolling(5, min_periods = min_per, center = True).mean().droplevel(0)
df[imp_feat] = (imputed.notna() & df[feat].isna()).astype(int)
df[feat] = df[feat].fillna(imputed)
print(f'Imputed {df[imp_feat].sum()} values for {feat}')
The min_periods
parameter sets the minimal number of non null values within the adjacent window. The windows is set to 5 and is centered around the current row.