Search code examples
pythonpandasgroup-bymissing-dataimputation

Pandas per group imputation of missing values


How can I achieve such a per-country imputation for each indicator in pandas?

I want to impute the missing values per group

  • no-A-state should get np.min per indicatorKPI
  • no-ISO-state should get the np.mean per indicatorKPI
  • for states with missing values, I want to impute with the per indicatorKPI mean. Here, this would mean to impute the missing values for Serbia

    mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia', 'austria', 'germany','serbia', 'austria',], 'indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'], 'value':[np.nan,np.nan,0.9,np.nan,0.7, 0.2, 0.3, 0.6]}) enter image description here

edit

The desired output should be similar to

mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
                                'germany','serbia','serbia', 'austria', 
                                'germany','serbia', 'austria',],
                   'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
                                   'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
                     'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
                   })

enter image description here


Solution

  • Based on your new example df the following works for me:

    In [185]:
    mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
    mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
    mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
    mydf
    
    Out[185]:
             Country    indicatorKPI     value
    0     no-A-state  SP.DYN.LE00.IN  0.200000
    1   no-ISO-state  NY.GDP.MKTP.CD  0.442857
    2     no-A-state  SP.DYN.LE00.IN  0.200000
    3   no-ISO-state  SP.DYN.LE00.IN  0.442857
    4        germany  NY.GDP.MKTP.CD  0.900000
    5         serbia  SP.DYN.LE00.IN  0.328571
    6         serbia  NY.GDP.MKTP.CD  0.700000
    7        austria  NY.GDP.MKTP.CD  0.585714
    8        germany  SP.DYN.LE00.IN  0.200000
    9         serbia  NY.GDP.MKTP.CD  0.300000
    10       austria  SP.DYN.LE00.IN  0.600000
    

    Basically what this does is to fill the missing values for each condition, so we set the min for the 'no-A-state' countries, then mean for 'no-ISO-state' countries. We then groupby on 'indicatorKPI' and calc the mean for each group and assign again to the null value rows, the respective countries' mean using map which performs a lookup

    Here are the steps broken down:

    In [187]:
    mydf.groupby('indicatorKPI')['value'].mean()
    
    Out[187]:
    
    indicatorKPI
    NY.GDP.MKTP.CD    0.633333
    SP.DYN.LE00.IN    0.400000
    Name: value, dtype: float64
    
    In [188]:
    mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
    
    Out[188]:
    0     0.400000
    1     0.633333
    2     0.400000
    3     0.400000
    4     0.633333
    5     0.400000
    6     0.633333
    7     0.633333
    8     0.400000
    9     0.633333
    10    0.400000
    Name: indicatorKPI, dtype: float64