Search code examples
pythonpandasdataframenanmedian

Fill NaN values of column X with the median value of X for each categorial variable in another column Y


This was very difficult to phrase. But let me show you what I'm trying to accomplish.

df

Y  X
a  10
a  5
a  NaN
b  12
b  13
b  NaN
c  5
c  NaN
c  5
c  6


Y: 10 non-null object
X: 7 non-null int64

Take category 'a' from column Y, it has the median X value (10+5/2), the other missing value for 'a' must be filled with this median value.

Similarly, for category 'b' from column Y, among the non missing values in column X, the median X values is, (12+13/2)

For category 'c' from column Y, among the non missing values in column X, the median X values is, 5 (middle most value)

I used a very long, repetitive code as follows.

    grouped = df.groupby(['Y'])[['X']]
    grouped.agg([np.median])

                X
                median
    Y
    a           7.5
    b           12.5
    c           5

    df.X = df.X.fillna(-1)

    df.loc[(df['Y'] == 'a') & (df['X'] == -1), 'X'] = 7.5
    df.loc[(df['Y'] == 'b') & (df['X'] == -1), 'X'] = 12.5
    df.loc[(df['Y'] == 'c') & (df['X'] == -1), 'X'] = 5

I was told that there is not only repetition but also the use of magic numbers, which should be avoided.

I want to write a function that does this filling efficiently.


Solution

  • Use groupby and transform
    The transform looks like

    df.groupby('Y').X.transform('median')
    
    0     7.5
    1     7.5
    2     7.5
    3    12.5
    4    12.5
    5    12.5
    6     5.0
    7     5.0
    8     5.0
    9     5.0
    Name: X, dtype: float64
    

    And this has the same index as before. Therefore we can easily use it to fillna

    df.X.fillna(df.groupby('Y').X.transform('median'))
    
    0    10.0
    1     5.0
    2     7.5
    3    12.0
    4    13.0
    5    12.5
    6     5.0
    7     5.0
    8     5.0
    9     6.0
    Name: X, dtype: float64
    

    You can either make a new copy of the dataframe

    df.assign(X=df.X.fillna(df.groupby('Y').X.transform('median')))
    
       Y     X
    0  a  10.0
    1  a   5.0
    2  a   7.5
    3  b  12.0
    4  b  13.0
    5  b  12.5
    6  c   5.0
    7  c   5.0
    8  c   5.0
    9  c   6.0
    

    Or fillna values in place

    df.X.fillna(df.groupby('Y').X.transform('median'), inplace=True)
    df
    
       Y     X
    0  a  10.0
    1  a   5.0
    2  a   7.5
    3  b  12.0
    4  b  13.0
    5  b  12.5
    6  c   5.0
    7  c   5.0
    8  c   5.0
    9  c   6.0