Search code examples
python-3.xpandaspandas-groupbynanimputation

Fill missing values by group using most frequent value


I am trying to impute missing values using the most frequent value by a group using the pandas module in Python. After checking some posts on Stack Overflow, I have managed to achieve that:

import numpy as np
import pandas as pd

df = pd.DataFrame({"group": ["A", "A", "A", "A", "B", "B", "B"],
                   "value": [1, 1, 1, np.nan, 2, np.nan, np.nan]})
df.groupby("group").transform(lambda x: x.fillna(x.mode().iloc[0]))

Running this code will fill with 1 the missing entry for group "A" and with 2 both missing entries for group "B". However, let's assume that one of the groups contains only missing data (in this case group "B"):

df1 = pd.DataFrame({"group": ["A", "A", "A", "A", "B", "B", "B"],
                   "value": [1, 1, 1, np.nan, np.nan, np.nan, np.nan]})
df1.groupby("group").transform(lambda x: x.fillna(x.mode().iloc[0]))

Running the code above will prompt an IndexError: single positional indexer is out-of-bounds. I would expect the normal behaviour is to keep np.nan because if you run the method mode for just, let's say, group "B" from df1:

df1[df1.group == "B"].mode()

I will get that NaN is the most common value. How can I avoid this issue?


Solution

  • Running the code above will prompt an IndexError: single positional indexer is out-of-bounds

    This is because transform gets to be passed each column as a series and at some point it will see the value column on its own; and if you do:

    df1[df1.group == "B"].value.mode()
    

    you get

    Series([], dtype: float64)
    

    hence the index-out-of-bounds like error as it is empty and iloc[0] doesn't exist.

    OTOH, when you do:

    df1[df1.group == "B"].mode()
    

    mode is calculated on a dataframe not a series and pandas decides to give a NaN on the full-NaN column i.e. value column here.

    So one remedy is to use apply instead of transform to pass a dataframe instead of individual series to your lambda:

    df1.groupby("group").apply(lambda x: x.fillna(x.mode().iloc[0])).reset_index(drop=True)
    

    to get

      group  value
    0     A    1.0
    1     A    1.0
    2     A    1.0
    3     A    1.0
    4     B    NaN
    5     B    NaN
    6     B    NaN