Search code examples
pythonpandassplit-apply-combine

Replacing dataframe values by median value of group


Apologies if this is a repeat, I didn't find a similar answer.

Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median (or whatever metric) from the entire column. My data has labels and I would like to replace each NaN value with the median value for that column from other samples belonging to that label only.

I can do this by splitting the df into one df for each label, imputing over each of those dfs, and combining, but this logic doesn't scale well. I could have up to 20 classes, and I fundamentally don't believe this is the 'right' way to do it.

I would like to do this without copying my df, by using a groupby object in a split-apply-combine technique (or another technique you think would work). I appreciate your help.

Example df:

   r1  r2   r3   label
0  12  NaN  58       0
1  34  52   24       1
2  32   4  NaN       1
3   7  89    2       0
4  22  19   12       1

Here, I would like the NaN value at (0, r2) to equal the median of that column for label 0, which is the value 89 (from 3, r2).

I would like the NaN value at (2,r3) to equal the median of that column for label 1, which is median(24, 12), or 18.

Example successful result:

   r1  r2   r3   label
0  12   89  58       0
1  34  52   24       1
2  32   4   18       1
3   7  89    2       0
4  22  19   12       1

Solution

  • In [158]: df.groupby('label', group_keys=False) \
                .apply(lambda x: x.fillna(x.median()).astype(int))
    Out[158]:
       r1  r2  r3  label
    0  12  89  58      0
    3   7  89   2      0
    1  34  52  24      1
    2  32   4  18      1
    4  22  19  12      1
    

    or using transform:

    In [149]: df[['label']].join(df.groupby('label')
                                   .transform(lambda x: x.fillna(x.median())).astype(int))
    Out[149]:
       label  r1  r2  r3
    0      0  12  89  58
    1      1  34  52  24
    2      1  32   4  18
    3      0   7  89   2
    4      1  22  19  12