Search code examples
pythonpandasdataframeimputation

replace specific value in pandas dataframes using the mean between 10 previous and next values


Let's say I have the following dataframe

df.Consumption

0        16.208
1        11.193
2         9.845
3         9.348
4         9.091
          ...  
19611     0.000
19612     0.000
19613     0.000
19614     0.000
19615     0.000
Name: Consumption, Length: 19616, dtype: float64

I want to replace the 0 values with the mean of the 10 previous and next values that are not 0.00

What is a good way to do it? I was thinking about using the replace and interpolate methods but I can't see how to write it efficiently


Solution

  • You can use Series.rolling() with center=True together with Rolling.mean() to get the mean of previous and next values.

    Replace 0 by NaN if you want to exclude 0 from the mean calculation.

    Set center=True so that the rolling windows look for both previous and next entries.

    Finally, set those entries with value 0 with the mean values by using .loc, as follows:

    n = 10     # check previous and next 10 entries
    
    # rolling window size is (2n + 1)
    Consumption_mean = (df['Consumption'].replace(0, np.nan)
                                         .rolling(n * 2 + 1, min_periods=1, center=True)
                                         .mean())
    
    df.loc[df['Consumption'] == 0, 'Consumption'] = Consumption_mean
    

    Demo

    Using smaller window size n = 3 to demonstrate:

    df
    
    
        Consumption
    0        16.208
    1        11.193
    2         9.845
    3         9.348
    4         9.091
    5         8.010
    6         0.000              <====   target entry
    7         7.100
    8         0.000              <====   target entry
    9         6.800
    10        6.500
    11        6.300
    12        5.900
    13        5.800
    14        5.600
    
    #n = 10     # check previous and next 10 entries
    n = 3     # smaller window size for demo
    
    # rolling window size is (2n + 1)
    Consumption_mean = (df['Consumption'].replace(0, np.nan)
                                         .rolling(n * 2 + 1, min_periods=1, center=True)
                                         .mean())
    
    # Update into a new column `Consumption_New` for demo purpose
    df['Consumption_New'] = df['Consumption']    
    df.loc[df['Consumption'] == 0, 'Consumption_New'] = Consumption_mean
    
    

    Demo Result:

    print(df)
    
        Consumption  Consumption_New
    0        16.208          16.2080
    1        11.193          11.1930
    2         9.845           9.8450
    3         9.348           9.3480
    4         9.091           9.0910
    5         8.010           8.0100
    6         0.000           8.0698   # 8.0698 = (9.348 + 9.091 + 8.01 + 7.1 + 6.8) / 5 with skipping 0.000 between 7.100 and 6.800
    7         7.100           7.1000
    8         0.000           6.9420   # 6.942 = (8.01 + 7.1 + 6.8 + 6.5 + 6.3) / 5 with skipping 0.000 between 8.010 and 7.100
    9         6.800           6.8000
    10        6.500           6.5000
    11        6.300           6.3000
    12        5.900           5.9000
    13        5.800           5.8000
    14        5.600           5.6000