My data contains several sub categories coded in the column "RID", I'm filling by the mean of each sub category. The code I've been using is very slow. Looking for a better method that gets rid of the for loop.
filled = mergedf.copy()
for c,v in enumerate(mergedf.RID.unique()):
filled.loc[filled.RID == v, :] = filled.loc[filled.RID == v, :].fillna(filled.loc[filled.RID == v, :].mean())
filled.info()
I've been trying the following to speed it up as someone suggested groupby, but I can't get the merges to work properly.
pts_mean = mergedf.groupby("RID").mean()
fill2 = merge.combine_first(pts_mean)
fill3 = pd.merge(mergedf, pts_mean, on="RID", how="left")
I've experimented with how = "inner" as well as how = "outer"
looking at my test data, before:
print(mergedf.loc[mergedf.RID==2,"FDG"])
0 1.36926
1 1.21655
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
after the slow method (this is the desired result, I just don't want it to take so long)
print(filled.loc[filled.RID==2,"FDG"])
0 1.369260
1 1.216550
2 1.292905
3 1.292905
4 1.292905
5 1.292905
6 1.292905
7 1.292905
8 1.292905
9 1.292905
10 1.292905
11 1.292905
12 1.292905
after the combine_first method
print(fill2.loc[fill2.RID==2,"FDG"])
0 1.369260
1 1.216550
2 1.292905
3 1.074235
4 NaN
5 1.319690
6 NaN
7 NaN
8 1.264300
9 NaN
10 1.042469
11 NaN
12 NaN
after the pd.merge
print(fill3.loc[fill3.RID==2,["FDG_x","FDG_y"]])
FDG_x FDG_y
0 1.36926 1.292905
1 1.21655 1.292905
2 NaN 1.292905
3 NaN 1.292905
4 NaN 1.292905
5 NaN 1.292905
6 NaN 1.292905
7 NaN 1.292905
8 NaN 1.292905
9 NaN 1.292905
10 NaN 1.292905
11 NaN 1.292905
12 NaN 1.292905
Let's try the following, using groupby
with transform
:
filled['FDG'].fillna(filled.groupby('RID')['FDG'].transform('mean'))
or
fill4 = filled.fillna(filled.groupby('RID').transform('mean'))