I have the following DataFrame train:
train = {'NAME_EDUCATION_TYPE': {5: 'Secondary / secondary special',
6: 'Higher education',
7: 'Higher education',
8: 'Secondary / secondary special',
9: 'Secondary / secondary special',
10: 'Higher education',
11: 'Secondary / secondary special',
12: 'Secondary / secondary special',
13: 'Secondary / secondary special',
14: 'Secondary / secondary special'},
'OCCUPATION_TYPE': {5: 'Laborers',
6: 'Accountants',
7: 'Managers',
8: nan,
9: 'Laborers',
10: 'Core staff',
11: nan,
12: 'Laborers',
13: 'Drivers',
14: 'Laborers'},
'AGE_GROUP': {5: '45-60',
6: '21-45',
7: '45-60',
8: '45-60',
9: '21-45',
10: '21-45',
11: '45-60',
12: '21-45',
13: '21-45',
14: '21-45'},
'DAYS_EMPLOYED': {5: -1588.0,
6: -3130.0,
7: -449.0,
8: nan,
9: -2019.0,
10: -679.0,
11: nan,
12: -2717.0,
13: -3028.0,
14: -203.0},
'DAYS_EMPLOYED_ANOM': {5: False,
6: False,
7: False,
8: True,
9: False,
10: False,
11: True,
12: False,
13: False,
14: False},
'DAYS_LAST_PHONE_CHANGE': {5: -2536.0,
6: -1562.0,
7: -1070.0,
8: 0.0,
9: -1673.0,
10: -844.0,
11: -2396.0,
12: -2370.0,
13: -4.0,
14: -188.0}}
I have a few NaN in the column DAYS_EMPLOYED. They are flagged as "True" in the column DAYS_EMPLOYED_ANOM. I want to impute these NaN using the median of DAYS_EMPLOYED by the following group of columns : NAME_EDUCATION_TYPE, OCCUPATION_TYPE and AGE_GROUP
I believe this can be done in a few lines in pandas but I could not figure it out. I have tried to apply the following code that I found in an example of mean imputation for a Series, but the NaN values do not get imputed.
fill_median = lambda g: g.fillna(g.median())
train.loc[train['DAYS_EMPLOYED_ANOM'] == True,'DAYS_EMPLOYED'] = train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'])['DAYS_EMPLOYED'].apply(fill_median)`
I also tried to apply the code from this post without success: How can I impute values to outlier cells based on groups?
You could do:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(x.median()))
)
However, note that this will not work on your particular dataset as you need to have at least one non NaN value per group to be able to calculate the median.
You could use the population median instead:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(train['DAYS_EMPLOYED'].median()))
)
Here is an hybrid approach to try calculating the group median and otherwise fall back to population one:
def median(s):
m = s.median()
if np.isnan(m):
m = train['DAYS_EMPLOYED'].median()
return m
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False
)
['DAYS_EMPLOYED'].apply(lambda x: x.fillna(median(s)))
)