Search code examples
pythongroup-byfillna

Impute missing values using group-level summary statistics from a different dataframe


I want to impute missing values using the grouped summary statistics based on a different dataframe. For instance, I would like to impute missing values in numvar_original in df1 to be like numvar_ideal where the missing values are based on group-level means from df2:

df1
catvar numvar_original numvar_ideal 
1      10              10
1      NaN             5.5
2      30              30
2      NaN             6.5


df2
catvar numvar_original
1      5
1      6
2      6
2      7

# I tried the following:
df1['numvar'].fillna(df2.groupby('catvar')['numvar'].transform('mean'), inplace=True) 
# The missing values weren't replaced 

df1['numvar'].fillna(df1.groupby('catvar')['numvar'].transform('mean'), inplace=True)  
# I checked that this works in filling up the missing values but I have to use group-level mean values from df1 

Is there a way to do this without resorting to a combination of apply or map with a dictionary (since I've read that apply/map can be slower with larger datasets)?


Solution

  • Try:

    means = df2.groupby("catvar")["numvar_original"].mean().to_dict()
    
    df1 = df1.groupby("catvar", group_keys=False).apply(
        lambda x: x.fillna(means[x["catvar"].iloc[0]])
    )
    print(df1)
    

    Prints:

       catvar  numvar_original  numvar_ideal
    0       1             10.0          10.0
    1       1              5.5           5.5
    2       2             30.0          30.0
    3       2              6.5           6.5
    

    OR: Using .merge:

    means = df2.groupby("catvar")["numvar_original"].mean()
    
    df1["numvar_original"] = df1["numvar_original"].fillna(
        df1.merge(means, on="catvar")["numvar_original_y"]
    )
    print(df1)
    

    Prints:

       catvar  numvar_original  numvar_ideal
    0       1             10.0          10.0
    1       1              5.5           5.5
    2       2             30.0          30.0
    3       2              6.5           6.5