Search code examples
pythonpandasreplacemissing-data

Fill missing values with either previous or subsequent values by key


I have this pandas dataframe: import pandas as pd import numpy as np

ds1 = {'col1':[1,1,1,1,1,1,1, 2,2,2,2,2,2,2], "col2" : [1,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,3]}

df1 = pd.DataFrame(data=ds1)
print(df1)

    col1  col2
0      1   1.0
1      1   NaN
2      1   NaN
3      1   NaN
4      1   NaN
5      1   NaN
6      1   NaN
7      2   NaN
8      2   NaN
9      2   NaN
10     2   NaN
11     2   NaN
12     2   NaN
13     2   3.0

I need to fill the missing values for col2 with the non-missing value present in col1 , for the same value of col1.

In this case, the resulting dataframe would look like this:

    col1  col2
0      1   1.0
1      1   1.0
2      1   1.0
3      1   1.0
4      1   1.0
5      1   1.0
6      1   1.0
7      2   3.0
8      2   3.0
9      2   3.0
10     2   3.0
11     2   3.0
12     2   3.0
13     2   3.0

Does anyone know how to do it in Python?


Solution

  • Couple of ways of doing this. The simpler approach using groupby and transform works if you only have one non-null value per group and you want that as the fill value for all NaNs:

    df1.groupby('col1')['col2'].transform('first')
    
    0     1.0
    1     1.0
    2     1.0
    3     1.0
    4     1.0
    5     1.0
    6     1.0
    7     3.0
    8     3.0
    9     3.0
    10    3.0
    11    3.0
    12    3.0
    13    3.0
    Name: col2, dtype: float64
    
    df1['col2'] = df1.groupby('col1')['col2'].transform('first')
    

    Alternatively, groupby and forwardfill/backfill per group.

    (df1.groupby('col1', as_index=False, group_keys=False)['col2']
        .apply(lambda c: c.ffill().bfill()))
    
    0     1.0
    1     1.0
    2     1.0
    3     1.0
    4     1.0
    5     1.0
    6     1.0
    7     3.0
    8     3.0
    9     3.0
    10    3.0
    11    3.0
    12    3.0
    13    3.0
    Name: col2, dtype: float64
    

    I recommend this option if you have data that looks like this: 1, NaN, NaN, 2, NaN, NaN, 3, NaN, NaN. Because these fill only backfill missing values based on the last non-null value.

    The trailing bfill call is useful if you have leading NaNs.