python-3.x pandas pandas-groupby nan imputation

Fill missing values by group using most frequent value

I am trying to impute missing values using the most frequent value by a group using the pandas module in Python. After checking some posts on Stack Overflow, I have managed to achieve that:

import numpy as np
import pandas as pd

df = pd.DataFrame({"group": ["A", "A", "A", "A", "B", "B", "B"],
                   "value": [1, 1, 1, np.nan, 2, np.nan, np.nan]})
df.groupby("group").transform(lambda x: x.fillna(x.mode().iloc[0]))

Running this code will fill with 1 the missing entry for group "A" and with 2 both missing entries for group "B". However, let's assume that one of the groups contains only missing data (in this case group "B"):

df1 = pd.DataFrame({"group": ["A", "A", "A", "A", "B", "B", "B"],
                   "value": [1, 1, 1, np.nan, np.nan, np.nan, np.nan]})
df1.groupby("group").transform(lambda x: x.fillna(x.mode().iloc[0]))

Running the code above will prompt an IndexError: single positional indexer is out-of-bounds. I would expect the normal behaviour is to keep np.nan because if you run the method mode for just, let's say, group "B" from df1:

df1[df1.group == "B"].mode()

I will get that NaN is the most common value. How can I avoid this issue?

Solution

Running the code above will prompt an IndexError: single positional indexer is out-of-bounds

This is because transform gets to be passed each column as a series and at some point it will see the value column on its own; and if you do:

df1[df1.group == "B"].value.mode()

you get

Series([], dtype: float64)

hence the index-out-of-bounds like error as it is empty and iloc[0] doesn't exist.

OTOH, when you do:

df1[df1.group == "B"].mode()

mode is calculated on a dataframe not a series and pandas decides to give a NaN on the full-NaN column i.e. value column here.

So one remedy is to use apply instead of transform to pass a dataframe instead of individual series to your lambda:

df1.groupby("group").apply(lambda x: x.fillna(x.mode().iloc[0])).reset_index(drop=True)

to get

  group  value
0     A    1.0
1     A    1.0
2     A    1.0
3     A    1.0
4     B    NaN
5     B    NaN
6     B    NaN