Search code examples
pandasdataframepandas-groupbyfillna

speed up a Pandas fillna by subcatagory mean (how to replace a for loop)


My data contains several sub categories coded in the column "RID", I'm filling by the mean of each sub category. The code I've been using is very slow. Looking for a better method that gets rid of the for loop.

filled = mergedf.copy()
for c,v in enumerate(mergedf.RID.unique()):
    filled.loc[filled.RID == v, :] = filled.loc[filled.RID == v, :].fillna(filled.loc[filled.RID == v, :].mean())

filled.info()

I've been trying the following to speed it up as someone suggested groupby, but I can't get the merges to work properly.

pts_mean = mergedf.groupby("RID").mean()
fill2 = merge.combine_first(pts_mean)

fill3 = pd.merge(mergedf, pts_mean, on="RID", how="left")

I've experimented with how = "inner" as well as how = "outer"

looking at my test data, before:

print(mergedf.loc[mergedf.RID==2,"FDG"])
0     1.36926
1     1.21655
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8         NaN
9         NaN
10        NaN
11        NaN
12        NaN

after the slow method (this is the desired result, I just don't want it to take so long)

print(filled.loc[filled.RID==2,"FDG"])
0     1.369260
1     1.216550
2     1.292905
3     1.292905
4     1.292905
5     1.292905
6     1.292905
7     1.292905
8     1.292905
9     1.292905
10    1.292905
11    1.292905
12    1.292905

after the combine_first method

print(fill2.loc[fill2.RID==2,"FDG"])
0     1.369260
1     1.216550
2     1.292905
3     1.074235
4          NaN
5     1.319690
6          NaN
7          NaN
8     1.264300
9          NaN
10    1.042469
11         NaN
12         NaN

after the pd.merge

print(fill3.loc[fill3.RID==2,["FDG_x","FDG_y"]])
      FDG_x     FDG_y
0   1.36926  1.292905
1   1.21655  1.292905
2       NaN  1.292905
3       NaN  1.292905
4       NaN  1.292905
5       NaN  1.292905
6       NaN  1.292905
7       NaN  1.292905
8       NaN  1.292905
9       NaN  1.292905
10      NaN  1.292905
11      NaN  1.292905
12      NaN  1.292905

Solution

  • Let's try the following, using groupby with transform:

    filled['FDG'].fillna(filled.groupby('RID')['FDG'].transform('mean'))
    

    or

    fill4 = filled.fillna(filled.groupby('RID').transform('mean'))