Search code examples
pythonpandasfillna

python pandas stop fillna at last non NaN value


I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse. My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.

e.g. for the dataset

a b c
2020-01-01 NaN 11 NaN
2020-02-01 1 NaN NaN
2020-03-01 NaN NaN 14
2020-04-01 2 NaN NaN
2020-05-01 NaN NaN NaN
2020-06-01 NaN NaN 15
2020-07-01 3 NaN NaN
2020-08-01 NaN NaN NaN

I want to output

a b c
2020-01-01 NaN 11 NaN
2020-02-01 1 NaN NaN
2020-03-01 1 NaN 14
2020-04-01 2 NaN 14
2020-05-01 2 NaN 14
2020-06-01 2 NaN 15
2020-07-01 3 NaN NaN
2020-08-01 NaN NaN NaN

I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function

I'd be grateful for any suggestions. Many thanks


Solution

  • Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:

    df = df.where(df.bfill().isna(), df.ffill())
    print (df)
                  a     b     c
    2020-01-01  NaN  11.0   NaN
    2020-02-01  1.0   NaN   NaN
    2020-03-01  1.0   NaN  14.0
    2020-04-01  2.0   NaN  14.0
    2020-05-01  2.0   NaN  14.0
    2020-06-01  2.0   NaN  15.0
    2020-07-01  3.0   NaN   NaN
    2020-08-01  NaN   NaN   NaN
    

    Your solution should be used too if compare Series converted to numpy array with broadcasting:

    mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
    df = df.where(mask, df.ffill())
    print (df)
                  a     b     c
    2020-01-01  NaN  11.0   NaN
    2020-02-01  1.0   NaN   NaN
    2020-03-01  1.0   NaN  14.0
    2020-04-01  2.0   NaN  14.0
    2020-05-01  2.0   NaN  14.0
    2020-06-01  2.0   NaN  15.0
    2020-07-01  3.0   NaN   NaN
    2020-08-01  NaN   NaN   NaN