I have a categorical column "WALLSMATERIAL_MODE" containing NaN that I want to impute using the mode by the following groups ['NAME_EDUCATION_TYPE', 'AGE_GROUP']:
NAME_EDUCATION_TYPE AGE_GROUP WALLSMATERIAL_MODE
20 Secondary / secondary special 45-60 Stone, brick
21 Secondary / secondary special 21-45 NaN
22 Secondary / secondary special 21-45 Panel
23 Secondary / secondary special 60-70 Mixed
24 Secondary / secondary special 21-45 Panel
25 Secondary / secondary special 45-60 Stone, brick
26 Secondary / secondary special 45-60 Wooden
27 Secondary / secondary special 21-45 NaN
28 Higher education 21-45 NaN
29 Higher education 21-45 Panel
Code for reproducibility
df = pd.DataFrame({'NAME_EDUCATION_TYPE': {20: 'Secondary / secondary special',
21: 'Secondary / secondary special',
22: 'Secondary / secondary special',
23: 'Secondary / secondary special',
24: 'Secondary / secondary special',
25: 'Secondary / secondary special',
26: 'Secondary / secondary special',
27: 'Secondary / secondary special',
28: 'Higher education',
29: 'Higher education'},
'AGE_GROUP': {20: '45-60',
21: '21-45',
22: '21-45',
23: '60-70',
24: '21-45',
25: '45-60',
26: '45-60',
27: '21-45',
28: '21-45',
29: '21-45'},
'WALLSMATERIAL_MODE': {20: 'Stone, brick',
21: np.nan,
22: 'Panel',
23: 'Mixed',
24: 'Panel',
25: 'Stone, brick',
26: 'Wooden',
27: np.nan,
28: np.nan,
29: 'Panel'}})
I tried adapting the following function from this post that works for median imputation and handles group medians that are NaN
IN:
def mode(s):
if pd.isnull(s.mode()):
return df['WALLSMATERIAL_MODE'].mode()
return s.mode()
df['WALLSMATERIAL_MODE'] = df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']], dropna=False).apply(lambda x: x.fillna(mode(x)))
OUT: The following error is raised when pd.isnull is called
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
I do not understand, I have tried to apply pd.isnull on all the group modes, and it does not raise this error. See the group modes below
IN:
df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']]).agg(pd.Series.mode).to_dict()
OUT:
{('Higher education', '60-70'): nan,
('Higher education', '45-60'): nan,
('Higher education', '21-45'): 'Panel',
('Higher education', '0-21'): nan,
('Secondary / secondary special', '60-70'): 'Mixed',
('Secondary / secondary special', '45-60'): 'Stone, brick',
('Secondary / secondary special', '21-45'): 'Panel',
('Secondary / secondary special', '0-21'): nan}
If anyone can tell where the mistake is or if there is an effective way to impute this column by groups, I will be thankful !
The below code seems to do the trick using try except. I'd rather I avoided using try except but I could not figure out a cleaner way.
def mode_cats(s):
try:
if pd.isnull(s.mode().any()): # check if the mode of the subgroup is NaN or contains NaN
# (mode() may indeed return a list of several modes)
m = app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0] # returns the mode of the column
else:
m = s.mode().iloc[0] # returns the mode of the subgroup
return m
except IndexError: # mode returns an empty series if the subgroup consists of a single NaN value
# this causes s.mode().iloc[0] to raise an index error
return app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0]
As @Ben.T pointed out, I had to use .iloc[0]
with .mode()
But then I get IndexError: single positional indexer is out-of-bounds
when the .mode().iloc[0]
has an empty array as input.
Traceback of the error: