I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse. My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a | b | c | |
---|---|---|---|
2020-01-01 | NaN | 11 | NaN |
2020-02-01 | 1 | NaN | NaN |
2020-03-01 | NaN | NaN | 14 |
2020-04-01 | 2 | NaN | NaN |
2020-05-01 | NaN | NaN | NaN |
2020-06-01 | NaN | NaN | 15 |
2020-07-01 | 3 | NaN | NaN |
2020-08-01 | NaN | NaN | NaN |
I want to output
a | b | c | |
---|---|---|---|
2020-01-01 | NaN | 11 | NaN |
2020-02-01 | 1 | NaN | NaN |
2020-03-01 | 1 | NaN | 14 |
2020-04-01 | 2 | NaN | 14 |
2020-05-01 | 2 | NaN | 14 |
2020-06-01 | 2 | NaN | 15 |
2020-07-01 | 3 | NaN | NaN |
2020-08-01 | NaN | NaN | NaN |
I can identify the index of the last observation using df.notna()[::-1].idxmax()
but can't figure out how to use this as a way to limit the fillna
function
I'd be grateful for any suggestions. Many thanks
Use DataFrame.where
for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN