Apologies if this is a repeat, I didn't find a similar answer.
Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median (or whatever metric) from the entire column. My data has labels and I would like to replace each NaN value with the median value for that column from other samples belonging to that label only.
I can do this by splitting the df into one df for each label, imputing over each of those dfs, and combining, but this logic doesn't scale well. I could have up to 20 classes, and I fundamentally don't believe this is the 'right' way to do it.
I would like to do this without copying my df, by using a groupby object in a split-apply-combine technique (or another technique you think would work). I appreciate your help.
Example df:
r1 r2 r3 label
0 12 NaN 58 0
1 34 52 24 1
2 32 4 NaN 1
3 7 89 2 0
4 22 19 12 1
Here, I would like the NaN value at (0, r2) to equal the median of that column for label 0, which is the value 89 (from 3, r2).
I would like the NaN value at (2,r3) to equal the median of that column for label 1, which is median(24, 12), or 18.
Example successful result:
r1 r2 r3 label
0 12 89 58 0
1 34 52 24 1
2 32 4 18 1
3 7 89 2 0
4 22 19 12 1
In [158]: df.groupby('label', group_keys=False) \
.apply(lambda x: x.fillna(x.median()).astype(int))
Out[158]:
r1 r2 r3 label
0 12 89 58 0
3 7 89 2 0
1 34 52 24 1
2 32 4 18 1
4 22 19 12 1
or using transform
:
In [149]: df[['label']].join(df.groupby('label')
.transform(lambda x: x.fillna(x.median())).astype(int))
Out[149]:
label r1 r2 r3
0 0 12 89 58
1 1 34 52 24
2 1 32 4 18
3 0 7 89 2
4 1 22 19 12