Search code examples
pythonpandasnumpymissing-data

Missing Values Imputation with the Mode of another column


In continuation to my previous post, I have a dataframe with 'id', 'x1' columns. I would like replace the missing value in 'x1' with the most frequent value from 'x1' for the respective 'id'.

For instanace, missing value is present in 'x1', against 'id' = 1. The missing value replacement should be 'e', cause 'e' is the frequent/mode (3 times) value for 'id' = 1.

df1 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 1, 1, 1, 4, 5],
                   'x1': ['a', np.nan, 'b', 'c', 'e', 'e', 'e', 'f', 'g', 'h']})

enter image description here

Here's the logic: credits: @renzo21


dict_id_to_x1_mode = df1.groupby('id')['x1'].agg(lambda x: pd.Series.mode(x)[0]).to_dict()
df1.loc[df['x1'].isna(), 'x1'] = df1.loc[df1['x1'].isna(), 'id'].map(dict_id_to_x1_mode)

It works fine on df1 but, I get ValueError: 0 is not in range for the df2 dataframe:

df2 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 1, 1, 1, 4, 5, 5, 6,np.nan],                    
                   'x1': ['a', np.nan, 'b', 'c', 'e', 'e', 'e', 'f', 'g', 'h', 'j', np.nan, 'k' ]}) 

enter image description here

It is possbily because of the missing values in x1 and I tried to add dropna=True in mode function but it gives the same error. Any workaround to fix the error would be helpful.

dict_id_to_x1_mode = df2.groupby('id')['x1'].agg(lambda x: pd.Series.mode(x, dropna=True)[0]).to_dict() 

Solution

  • If need set NaN for groups with all missing values use next with iter trick:

    dict_id_to_x1_mode = df2.groupby('id')['x1'].agg(lambda x: next(iter(x.mode()), np.nan)).to_dict()
    print (dict_id_to_x1_mode)
    {1.0: 'e', 2.0: 'b', 3.0: 'c', 4.0: 'g', 5.0: 'h', 6.0: nan}