Search code examples
pythonpandas

Dynamically replace NaN values with the average of previous and next non-missing values


I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.

Therefore, the result dataframe should look like:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325  1.533582
4 -0.621108 -1.319834  0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858  1.033826 -2.551718
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).


Solution

  • You can try to understand your logic behind the average that is Geometric progression

    s=df.isnull().cumsum()
    t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
    t2=df.lookup(s.idxmax()+1,s.idxmax().index)
    df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
    Out[212]: 
              A         B         C
    0 -0.166919  0.979728 -0.632955
    1 -0.297953 -0.912674 -1.365463
    2 -0.120211 -0.540679 -0.680481
    3 -0.454142 -2.027325  1.533582
    4 -0.621107 -1.319834  0.461821
    5 -0.788073 -0.966089 -1.260201
    6 -0.916080 -0.612343 -2.121213
    7 -0.887858  1.033826 -2.551718
    8  1.948430  1.025011 -2.982224
    9  0.019698 -0.795876 -0.046431
    

    Explanation:

    1st NaN x/2+y/2=1st

    2nd NaN 1st/2+y/2=2nd

    3rd NaN 2nd/2+y/2+3rd

    Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key