Search code examples
pandaspandas-groupbynancategorical-dataimputation

mode imputation by groups in pandas (handling group modes that are NaN)


I have a categorical column "WALLSMATERIAL_MODE" containing NaN that I want to impute using the mode by the following groups ['NAME_EDUCATION_TYPE', 'AGE_GROUP']:

    NAME_EDUCATION_TYPE             AGE_GROUP   WALLSMATERIAL_MODE
20  Secondary / secondary special   45-60       Stone, brick
21  Secondary / secondary special   21-45       NaN
22  Secondary / secondary special   21-45       Panel
23  Secondary / secondary special   60-70       Mixed
24  Secondary / secondary special   21-45       Panel
25  Secondary / secondary special   45-60       Stone, brick
26  Secondary / secondary special   45-60       Wooden
27  Secondary / secondary special   21-45       NaN
28  Higher education                21-45       NaN
29  Higher education                21-45       Panel

Code for reproducibility

df = pd.DataFrame({'NAME_EDUCATION_TYPE': {20: 'Secondary / secondary special',
  21: 'Secondary / secondary special',
  22: 'Secondary / secondary special',
  23: 'Secondary / secondary special',
  24: 'Secondary / secondary special',
  25: 'Secondary / secondary special',
  26: 'Secondary / secondary special',
  27: 'Secondary / secondary special',
  28: 'Higher education',
  29: 'Higher education'},
 'AGE_GROUP': {20: '45-60',
  21: '21-45',
  22: '21-45',
  23: '60-70',
  24: '21-45',
  25: '45-60',
  26: '45-60',
  27: '21-45',
  28: '21-45',
  29: '21-45'},
 'WALLSMATERIAL_MODE': {20: 'Stone, brick',
  21: np.nan,
  22: 'Panel',
  23: 'Mixed',
  24: 'Panel',
  25: 'Stone, brick',
  26: 'Wooden',
  27: np.nan,
  28: np.nan,
  29: 'Panel'}})

I tried adapting the following function from this post that works for median imputation and handles group medians that are NaN

IN:

def mode(s):
    if pd.isnull(s.mode()):
        return df['WALLSMATERIAL_MODE'].mode()
    return s.mode()
        
df['WALLSMATERIAL_MODE'] = df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']], dropna=False).apply(lambda x: x.fillna(mode(x)))

OUT: The following error is raised when pd.isnull is called

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I do not understand, I have tried to apply pd.isnull on all the group modes, and it does not raise this error. See the group modes below

IN:

df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']]).agg(pd.Series.mode).to_dict()

OUT:

{('Higher education', '60-70'): nan,
 ('Higher education', '45-60'): nan,
 ('Higher education', '21-45'): 'Panel',
 ('Higher education', '0-21'): nan,
 ('Secondary / secondary special', '60-70'): 'Mixed',
 ('Secondary / secondary special', '45-60'): 'Stone, brick',
 ('Secondary / secondary special', '21-45'): 'Panel',
 ('Secondary / secondary special', '0-21'): nan}

If anyone can tell where the mistake is or if there is an effective way to impute this column by groups, I will be thankful !


Solution

  • The below code seems to do the trick using try except. I'd rather I avoided using try except but I could not figure out a cleaner way.

    def mode_cats(s):
            try:
                if pd.isnull(s.mode().any()): # check if the mode of the subgroup is NaN or contains NaN 
                                              # (mode() may indeed return a list of several modes)
                    m = app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0] # returns the mode of the column
                else:
                    m = s.mode().iloc[0]  # returns the mode of the subgroup
                return m
            except IndexError: # mode returns an empty series if the subgroup consists of a single NaN value
                               # this causes s.mode().iloc[0] to raise an index error
                return app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0]
    

    As @Ben.T pointed out, I had to use .iloc[0] with .mode() But then I get IndexError: single positional indexer is out-of-bounds when the .mode().iloc[0] has an empty array as input. Traceback of the error:

    1. mode() is called on a subgroup of one row with value = NaN. .mode() returns an empty array for this subgroup of a single NaN
    2. pd.isnull is called on the passed empty array and returns an empty array
    3. Calling .iloc[0] on an empty array raises the index error