Search code examples
pythonpandasdataframepandas-groupbyfillna

How to fill missing values in a dataframe based on group value counts?


I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.

import pandas as pd 
import numpy as np

year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']

X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()

It gives:

print(X)
   year  condition
0  2015       good
1  2016       good
2  2017  excellent
3  2016       good
4  2016  excellent
5  2017  excellent
6  2015        NaN
7  2016       good
8  2015  excellent
9  2015       good

print(stat)
year  condition
2015  good         2
      excellent    1
2016  good         3
      excellent    1
2017  excellent    2

As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.

I have tried with fillna and .transform method but it does not work :(

I would be grateful for any help.


Solution

  • I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):

    In[0]:
    fill_dict = stat.unstack().idxmax(axis=1).to_dict()
    fill_dict
    
    Out[0]:
    {2015: 'good', 2016: 'good', 2017: 'excellent'}
    

    Then use fillna with map based on this dictionary (credit to this answer):

    In[0]:
    X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
    X
    
    Out[0]:
       year  condition
    0  2015       good
    1  2016       good
    2  2017  excellent
    3  2016       good
    4  2016  excellent
    5  2017  excellent
    6  2015       good
    7  2016       good
    8  2015  excellent
    9  2015       good