Search code examples
pandasdataframeimputation

Pandas: Imputing values per groups if there is sufficient adjacent data


I have customer data for individual periods. For some customers for some periods there are missing values. I would like to impute these values according to the following rule:

If the values for the preceding 2 periods and the following 2 periods exist, replace the missing value with the average of these 4 values. Also, I would like to set an indicator flag that this value has been imputed.

Here is the starting point:

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2,2],
                 'period' : [1,2,3,4,5,6,1,2,3,4,5,6],
                 'volume' : [1,2,NaN,4,5,6,7,NaN,9,Nan,11,Nan]})

In the dataframe above for customer1 there is a missing value for period 3. According to the above rule it can be filled with the average of the 2 values before it and the two values that follow it.

For customer2 this rule won't change anything - there are 3 missing values in his case but there are not enough values on either side of each NaN value to compute the average.

So the end result will be:

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2,2],
                 'period' : [1,2,3,4,5,6,1,2,3,4,5,6],
                 'volume' : [1,2,6,4,5,6,7,NaN,9,Nan,11,Nan],
                 'imputed' : [0,0,1,0,0,0,0,0,0,0,0,0]})

How to do that?


Solution

  • The following function does what is needed:

    def impute_means (df, feat, min_per):
        imp_feat = f'imputed_{feat}'
        imputed = df.groupby('cust_id')[feat].rolling(5, min_periods = min_per, center = True).mean().droplevel(0)
        df[imp_feat] = (imputed.notna() & df[feat].isna()).astype(int)
        df[feat] = df[feat].fillna(imputed)
        print(f'Imputed {df[imp_feat].sum()} values for {feat}')
    

    The min_periods parameter sets the minimal number of non null values within the adjacent window. The windows is set to 5 and is centered around the current row.