Search code examples
pythonpandasdataframenumpyfillna

Fill missing values based on the condition of other columns


I have this large dataframe, illustrated below is for simplicity purposes.

pd.DataFrame(df.groupby(['Pclass', 'Sex'])['Age'].median())

Groupby results: enter image description here


And it have this data that needs to be imputed

Missing Data: enter image description here

How can I impute these values based on the median of the grouped statistic

The result that I want is:

enter image description here

# You can use this for reference

import numpy as np
import pandas as pd
  
mldx_arrays = [np.array([1, 1,
                         2, 2,
                         3, 3]),
                 
               np.array(['male', 'female', 
                         'male', 'female',
                         'male', 'female'])]
  
multiindex_df = pd.DataFrame(
  [34,29,24,40,18,25], index=mldx_arrays,
  columns=['Age'])
  
multiindex_df.index.names = ['PClass', 'Sex']
multiindex_df


d = {'PClass': [1, 1, 2, 2, 3, 3], 
     'Sex': ['male', 'female', 'male', 'female', 'male', 'female'], 
     'Age': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)


Solution

  • If all values are missing remove Age column and use DataFrame.join:

    df = df.drop('Age', axis=1).join(multiindex_df, on=['PClass','Sex'])
    print (df)
       PClass     Sex  Age
    0       1    male   34
    1       1  female   29
    2       2    male   24
    3       2  female   40
    4       3    male   18
    5       3  female   25
    

    If need replace only missing values use DataFrame.join and replace missing values in original column:

    df = df.join(multiindex_df, on=['PClass','Sex'], rsuffix='_')
    df['Age'] = df['Age'].fillna(df.pop('Age_'))
    print (df)
       PClass     Sex   Age
    0       1    male  34.0
    1       1  female  29.0
    2       2    male  24.0
    3       2  female  40.0
    4       3    male  18.0
    5       3  female  25.0
    

    If need replace missing values by median per groups use GroupBy.transform:

    df['Age'] = df['Age'].fillna(df.groupby(['PClass', 'Sex'])['Age'].transform('median'))