Search code examples
pythonpandasdataframenan

How to pad/ffill NaN values in a Pandas dataframe using "inside" as limit_area?


I have a Pandas dataframe containing several NaNs like the following one

df = pd.DataFrame(np.array([[1,np.nan,np.nan, 2, 3, np.nan], [np.nan,np.nan,0,0,np.nan,2], [np.nan,4,np.nan,np.nan,5,np.nan]]).transpose())
print(df)
       0      1      2
0    1.0    NaN    NaN
1    NaN    NaN    4.0
2    NaN    0.0    NaN
3    2.0    0.0    NaN
4    3.0    NaN    5.0
5    NaN    2.0    NaN

Now, I would like to fill up the NaN values by respecting two restrictions:

  1. Only fill the NaNs surrounded by valid values (= don't replace leading or trailing NaN's)
  2. Use method "pad" (=ffill) for replacing the NaNs by the preceding valid number in that column

Desired solution:

       0      1      2
0    1.0    NaN    NaN
1    1.0    NaN    4.0
2    1.0    0.0    4.0
3    2.0    0.0    4.0
4    3.0    0.0    5.0
5    NaN    2.0    NaN

I found Pandas interpolate function which sounded quite promising but unfortunately I'm only able to achieve one of the mentioned restrictions.

When I use

df_padded = df.interpolate(method='pad')

the right values are used (-> preceding number of the respective column) but also the NaNs at the end of column 0 and 2 are replaced which I do not intend.

       0      1      2
0    1.0    NaN    NaN
1    1.0    NaN    4.0
2    1.0    0.0    4.0
3    2.0    0.0    4.0
4    3.0    0.0    5.0
5    3.0    2.0    5.0

When I use

df_padded = df.interpolate(limit_area='inside')

it replaces the right NaNs but it interpolates in a linear way which is also not what I want.

         0      1        2
0    1.000    NaN      NaN
1    1.333    NaN    4.000
2    1.667    0.0    4.333
3    2.000    0.0    4.667
4    3.000    1.0    5.000
5      NaN    2.0      NaN

So I thought using both parameters in the function call would give me the right output

df_padded = df.interpolate(method_'pad', limit_area='inside')

but instead nothing is happening at all.

       0      1      2
0    1.0    NaN    NaN
1    NaN    NaN    4.0
2    NaN    0.0    NaN
3    2.0    0.0    NaN
4    3.0    NaN    5.0
5    NaN    2.0    NaN

What am I doing wrong?


Solution

  • Ideally the interpolate solution should have worked, I believe it might be a bug nevertheless here is alternative approach that you can use

    s1, s2 = df.ffill(), df.bfill()
    result = s1.mask(s1.isna() | s2.isna())
    

    Result

         0    1    2
    0  1.0  NaN  NaN
    1  1.0  NaN  4.0
    2  1.0  0.0  4.0
    3  2.0  0.0  4.0
    4  3.0  0.0  5.0
    5  NaN  2.0  NaN