I have this pandas dataframe: import pandas as pd import numpy as np
ds1 = {'col1':[1,1,1,1,1,1,1, 2,2,2,2,2,2,2], "col2" : [1,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,3]}
df1 = pd.DataFrame(data=ds1)
print(df1)
col1 col2
0 1 1.0
1 1 NaN
2 1 NaN
3 1 NaN
4 1 NaN
5 1 NaN
6 1 NaN
7 2 NaN
8 2 NaN
9 2 NaN
10 2 NaN
11 2 NaN
12 2 NaN
13 2 3.0
I need to fill the missing values for col2
with the non-missing value present in col1
, for the same value of col1
.
In this case, the resulting dataframe would look like this:
col1 col2
0 1 1.0
1 1 1.0
2 1 1.0
3 1 1.0
4 1 1.0
5 1 1.0
6 1 1.0
7 2 3.0
8 2 3.0
9 2 3.0
10 2 3.0
11 2 3.0
12 2 3.0
13 2 3.0
Does anyone know how to do it in Python?
Couple of ways of doing this. The simpler approach using groupby
and transform
works if you only have one non-null value per group and you want that as the fill value for all NaNs:
df1.groupby('col1')['col2'].transform('first')
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 1.0
7 3.0
8 3.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
Name: col2, dtype: float64
df1['col2'] = df1.groupby('col1')['col2'].transform('first')
Alternatively, groupby
and forwardfill/backfill per group.
(df1.groupby('col1', as_index=False, group_keys=False)['col2']
.apply(lambda c: c.ffill().bfill()))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 1.0
7 3.0
8 3.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
Name: col2, dtype: float64
I recommend this option if you have data that looks like this: 1, NaN, NaN, 2, NaN, NaN, 3, NaN, NaN. Because these fill only backfill missing values based on the last non-null value.
The trailing bfill
call is useful if you have leading NaNs.