Search code examples
pythonpandasdataframedata-cleaning

Filling NaN cells based on a value in the same row


I've asked a similar question to this and it got no reply, so I thought I'd take a different approach and see if anyone knows how to do this;

First I'll tell you my goal and what I already know:

I am currently cleaning a dataset and need to backward fill the dataset to get rid of some of the NaN values.

From the image below

Data

I would like to backward fill the Na columns of the same X column value, and fill the Na cell with a Y value that has a row value of 1

This image shows what outcome I would like

Data

I already know I can use

 df.loc[df['Y'] == 1] = df.loc[:,].bfill(limit=1)

to get it to only fill cells that are matching with a Y value row of 1 (hence the bottom Na cell is not filled).

Here is my question: Using the code above, it fills the middle Na because the Y value to the left is 1, this is fine for the top cell because the source cell and Na cell both have the X value of 1, although for the middle Na there is an X value of 2 and 3. So, is there a way to fill cells that share the same X value down the row? (the X values need to be the same between the source and the Na, if not, nothing happens.)

Thanks!


Solution

  • We can try with loc + groupby bfill:

    df.loc[df['Y'] == 1, 'Z'] = df.groupby('X')['Z'].bfill()
    

    groupby will ensure that each group of X values is treated independently, bfill will backfill per group. df['Y'] == 1 ensures that only rows with Y value of 1 will be updated.

    df:

       X  Y    Z
    0  1  1  2.0
    1  1  2  2.0
    2  2  1  NaN
    3  3  1  3.0
    4  3  2  NaN
    5  4  1  4.0
    

    Initial Frame:

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({'X': [1, 1, 2, 3, 3, 4],
                       'Y': [1, 2, 1, 1, 2, 1],
                       'Z': [np.nan, 2, np.nan, 3, np.nan, 4]})
    

    df:

       X  Y    Z
    0  1  1  NaN
    1  1  2  2.0
    2  2  1  NaN
    3  3  1  3.0
    4  3  2  NaN
    5  4  1  4.0
    

    Edit to bfill all columns except X and Y use:

    df.loc[df['Y'] == 1, df.columns.difference(['X', 'Y'])] = df.groupby('X').bfill()