Search code examples
pythonpandasdataframe

Pandas: How to fill NaN within a group, only if a certain column contains NaN


I am trying to use Pandas to transform a df from this:

Account Value OtherColumn Another Column
0 A 1.0 1.0 1.0
1 A NaN NaN NaN
2 A NaN NaN NaN
3 A 6.0 NaN NaN
4 A NaN NaN NaN
5 B 3.0 2.0 NaN
6 B NaN NaN NaN
7 B 4.0 NaN NaN
8 B NaN NaN NaN
9 C NaN NaN NaN
10 C 5.0 5.0 NaN

To:

Account Value OtherColumn Another Column
0 A 1.0 1.0 1.0
1 A 1.0 1.0 1.0
2 A 1.0 1.0 1.0
3 A 6.0 NaN NaN
4 A 6.0 NaN NaN
5 B 3.0 2.0 NaN
6 B 3.0 2.0 NaN
7 B 4.0 NaN NaN
8 B 4.0 NaN NaN
9 C 5.0 5.0 NaN
10 C 5.0 5.0 NaN

What I would like to do is, within a group, is to fill missing data though a combination of ffill() and bfill(). However, I have certain requirements:

If 'Value' is not NaN, then I don't want any columns in that row filled.

If 'Value' is NaN it should be filled with the first row above in which 'Value' is not NaN, within the group 'Account'. Then the process should be repeated by filling backward.

What I tried was this:

data = {'Account': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
        'Value': [1.0, np.nan, np.nan, 6.0, np.nan, 3.0, np.nan, 4.0, np.nan, np.nan, 5.0],
        'OtherColumn': [1.0, np.nan, np.nan, np.nan, np.nan, 2.0, np.nan, np.nan, np.nan, np.nan, 5.0],
        'Another Column': [1.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}

df = pd.DataFrame(data)
df_filled = df.groupby('Account').ffill().bfill()
print(df_filled)

However, this returns a dataframe that looks like:

Value OtherColumn Another Column
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 6.0 1.0 1.0
4 6.0 1.0 1.0
5 3.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 4.0 2.0 NaN
9 5.0 5.0 NaN
10 5.0 5.0 NaN

As you can see, on rows 3 and 4, the OtherColumn and Another Column have been filled with 1.0. I want these rows to remain as NaN. Same with rows 7 and 8 which are filled with the value of 2.0.

I cannot use ffill(1) because the number of rows that need to be forward (or back) filled varies)


Solution

  • Here I'm doing ffill and bfill on Value column first because it looks like your business logic is to ffill and bfill OtherColumn and Another Column based on group of Account and Value or at least I think it should be based on what you have described.

    Once we do that, then we can group the dataframe based on Account and Value and ffill and bfill the group. I'm using here apply to do the ffill and bfill since chaining the two methods .ffill().bfill() results in a behavior not expected (since result of ffill would be a series and we are applying bfill to a series instead of dataframe). See this link for more info

    import pandas as pd
    import numpy as np
    
    data = {'Account': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
            'Value': [1.0, np.nan, np.nan, 6.0, np.nan, 3.0, np.nan, 4.0, np.nan, np.nan, 5.0],
            'OtherColumn': [1.0, np.nan, np.nan, np.nan, np.nan, 2.0, np.nan, np.nan, np.nan, np.nan, 5.0],
            'Another Column': [1.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
    
    df = pd.DataFrame(data)
    df['Value'] = df.groupby('Account')['Value'].ffill().bfill()
    df_filled = df.groupby(['Account', 'Value']).apply(lambda x: x.ffill().bfill())
    print(df_filled)
    

    Output

                     Account  Value  OtherColumn  Another Column
    Account Value                                               
    A       1.0   0        A    1.0          1.0             1.0
                  1        A    1.0          1.0             1.0
                  2        A    1.0          1.0             1.0
            6.0   3        A    6.0          NaN             NaN
                  4        A    6.0          NaN             NaN
    B       3.0   5        B    3.0          2.0             NaN
                  6        B    3.0          2.0             NaN
            4.0   7        B    4.0          NaN             NaN
                  8        B    4.0          NaN             NaN
    C       5.0   9        C    5.0          5.0             NaN
                  10       C    5.0          5.0             NaN
    

    EDIT: You can drop level to get the dataframe without multilevel index

    df_filled = df_filled.droplevel(['Account', 'Value'])
    

    Output

       Account  Value  OtherColumn  Another Column
    0        A    1.0          1.0             1.0
    1        A    1.0          1.0             1.0
    2        A    1.0          1.0             1.0
    3        A    6.0          NaN             NaN
    4        A    6.0          NaN             NaN
    5        B    3.0          2.0             NaN
    6        B    3.0          2.0             NaN
    7        B    4.0          NaN             NaN
    8        B    4.0          NaN             NaN
    9        C    5.0          5.0             NaN
    10       C    5.0          5.0             NaN