Search code examples
pandasnanseriesfillna

Fill Consecutive NaNs in Pandas Series


I want to fill missing values in my pandas series, if there are less than 3 consecutive NANs.

Original series with missing values:

s=pd.Series(pd.np.random.randn(20))
s[[1,3,5,7,12,13,14,15, 18]]=pd.np.nan

Gives:

0     0.444025
1          NaN
2     0.631753
3          NaN
4    -0.577121
5          NaN
6     1.299953
7          NaN
8    -0.252173
9     0.287641
10    0.941953
11   -1.624728
12         NaN
13         NaN
14         NaN
15         NaN
16    0.998952
17    0.195698
18         NaN
19   -0.788995

BUT, using pandas.fillna() with a limit only fills the # of values specified (not number of CONSECUTIVE NANs, as expected):

s.fillna(value=0, limit=3) #Fails to fill values at position 7 and forward

Desired output would fill NANs with 0 at positions 1,3,5,7, and 18. It would leave series of 4 NaNs in place in position 12-15.

The documentation and other posts on SO have not resolved this issue (e.g. here). Documentation seems to imply that this limit will work on consecutive NANs, not the overall # in entire dataset that will be filled. Thanks!


Solution

  • We start with finding where the nan values are via pd.Series.notna.

    As we use cumsum, whenever we encounter a non-null value, we increment the cumulative sum this generating convenient groups for contiguous nan values.

    However, for all but the first group (and maybe the first group) we begin with a non-null value. So, I take the negation of mask and sum the total number of null values within each group.

    Now I fillna and use pd.DataFrame.where to mask the spots where the sum of nan values was too much.

    mask = s.notna()
    c_na = (~mask).groupby(mask.cumsum()).transform('sum')
    filled = s.fillna(0).where(c_na.le(3))
    s.fillna(filled)
    
    0     1.418895
    1     0.000000
    2    -0.553732
    3     0.000000
    4    -0.101532
    5     0.000000
    6    -1.334803
    7     0.000000
    8     1.159115
    9     0.309093
    10   -0.047970
    11    0.051567
    12         NaN
    13         NaN
    14         NaN
    15         NaN
    16    0.623673
    17   -0.786857
    18    0.000000
    19    0.310688
    dtype: float64
    

    Here is a fancy Numpy/Pandas way using np.bincount and pd.factorize

    v = s.values
    m = np.isnan(v)
    f, u = pd.factorize((~m).cumsum())
    filled = np.where(
        ~m, v,
        np.where(np.bincount(f, weights=mask)[f] <= 3, 0, np.nan)
    )
    
    pd.Series(filled, s.index)
    
    0     1.418895
    1     0.000000
    2    -0.553732
    3     0.000000
    4    -0.101532
    5     0.000000
    6    -1.334803
    7     0.000000
    8     1.159115
    9     0.309093
    10   -0.047970
    11    0.051567
    12         NaN
    13         NaN
    14         NaN
    15         NaN
    16    0.623673
    17   -0.786857
    18    0.000000
    19    0.310688
    dtype: float64