I've asked a similar question to this and it got no reply, so I thought I'd take a different approach and see if anyone knows how to do this;
First I'll tell you my goal and what I already know:
I am currently cleaning a dataset and need to backward fill the dataset to get rid of some of the NaN
values.
From the image below
I would like to backward fill the Na columns of the same X column value, and fill the Na cell with a Y value that has a row value of 1
This image shows what outcome I would like
I already know I can use
df.loc[df['Y'] == 1] = df.loc[:,].bfill(limit=1)
to get it to only fill cells that are matching with a Y value row of 1 (hence the bottom Na cell is not filled).
Here is my question: Using the code above, it fills the middle Na because the Y value to the left is 1, this is fine for the top cell because the source cell and Na cell both have the X value of 1, although for the middle Na there is an X value of 2 and 3. So, is there a way to fill cells that share the same X value down the row? (the X values need to be the same between the source and the Na, if not, nothing happens.)
Thanks!
We can try with loc
+ groupby bfill
:
df.loc[df['Y'] == 1, 'Z'] = df.groupby('X')['Z'].bfill()
groupby
will ensure that each group of X
values is treated independently, bfill
will backfill per group. df['Y'] == 1
ensures that only rows with Y
value of 1 will be updated.
df
:
X Y Z
0 1 1 2.0
1 1 2 2.0
2 2 1 NaN
3 3 1 3.0
4 3 2 NaN
5 4 1 4.0
Initial Frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': [1, 1, 2, 3, 3, 4],
'Y': [1, 2, 1, 1, 2, 1],
'Z': [np.nan, 2, np.nan, 3, np.nan, 4]})
df
:
X Y Z
0 1 1 NaN
1 1 2 2.0
2 2 1 NaN
3 3 1 3.0
4 3 2 NaN
5 4 1 4.0
Edit to bfill all columns except X and Y use:
df.loc[df['Y'] == 1, df.columns.difference(['X', 'Y'])] = df.groupby('X').bfill()