Search code examples
pythonpandaspandas-groupbyapplyfillna

Fill nan values in test data with mean values form train data in pandas


I am trying to fill nan values from test data with mean value from train data on multiple columns based or grouped by a column. Here is part of the test data:

enter image description here

    date_block_num  shop_id     item_id     item_category_id    target  item_price  avg_item_price  sum_item_cnt_day    avg_item_cnt_day    shop_avg_item_price     ...     avg_item_cnt_day_lag_12     shop_avg_item_price_lag_12  shop_sum_item_cnt_day_lag_12    shop_avg_item_cnt_day_lag_12    category_avg_item_price_lag_12  category_sum_item_cnt_day_lag_12    category_avg_item_cnt_day_lag_12    shop_avg_item_price_per_category_lag_12     shop_sum_item_cnt_per_category_lag_12   shop_avg_item_cnt_per_category_lag_12
0   26.5    5   5037    19.0    0.928571    1788.897788     1934.764286     90.714286   1.937141    868.822366  ...     0.383736    619.341077  181.571429  0.029328    716.813821  779.214286  0.084066    716.052585  10.285714   0.056515
1   NaN     5   5320    NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   30.0    5   5233    19.0    1.428571    854.861715  842.778086  72.428571   1.685456    914.767445  ...     0.000000    597.460870  0.000000    0.000000    591.507516  0.000000    0.000000    591.790514  0.000000    0.000000
3   32.0    5   5232    23.0    0.333333    728.018465  790.297277  47.000000   1.100087    965.966832  ...     0.000000    597.460870  0.000000    0.000000    591.507516  0.000000    0.000000    591.790514  0.000000    0.000000
4   NaN     5   5268    NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN

5 rows × 102 columns

So I want replace nans with the mean value of a column based on item_id.

first I know that I can get the mean of the columns of the train data grouped by item_id as follows:

mt = train.groupby('item_id').apply(lambda x: np.mean(x))

Then I see that I try to use fillna for each column in test set as follows:

for col in test.columns:

    test[col] = test.groupby('item_id')[col].apply(lambda x: x.fillna...)

I cannot get how to replace with mean values from the train and not test. How to do it? And is this the best way or there is a better way? Thanks.


Solution

  • If your train and test dataframe have the same shape (# rows, # cols)

    We can use pandas.DataFrame.combine_first for this, but not in combination with the aggregation that happens in groupby since combine_first replaces NaN with the values on the same position in the other dataframe.

    Thats why we need to use pandas.DataFrame.transform for this, since it keeps the shape of our dataframe the same:

    # make two example dataframes
    train = pd.DataFrame({'item_id':[5037, 5320, 5037, 5320],
                          'num1': [10, 8, 9, 5],
                          'num2': [3, 5, 1, 9]})
    
    test = pd.DataFrame({'item_id':[5037, 5320, 5037, 5320],
                         'num1': [6, np.NaN, 3, 7],
                         'num2': [np.NaN, 4, np.NaN, 9]})
    
    print(train, '\n')
    print(test)
    
       item_id  num1  num2
    0     5037    10     3
    1     5320     8     5
    2     5037     9     1
    3     5320     5     9 
    
       item_id  num1  num2
    0     5037   6.0   NaN
    1     5320   NaN   4.0
    2     5037   3.0   NaN
    3     5320   7.0   9.0
    

    We apply groupby.transform and combine_first

    train_means = train.groupby('item_id').transform('mean')
    test.combine_first(train_means)
    
       item_id  num1  num2
    0     5037   6.0   2.0
    1     5320   6.5   4.0
    2     5037   3.0   2.0
    3     5320   7.0   9.0
    

    If your train and test dataframe DO NOT have the same shape (# rows, # cols), it gets a bit more complex.

    We can do the following:

    1. We can get the mean of each item_id and the valeus with pandas.groupby.mean
    2. After that we pandas.DataFrame.merge to get the means of each corresponding item_id and mean of train dataframe to our test dataframe.
    3. Then we make a dict of our column names and we conditionally fill our NaN with the values of the same column which came from our train dataset where groupby was applied on. And we use np.where for this.
    train_grp = train.groupby('item_id').mean().reset_index()
    
    print(train_grp)
       item_id  num1  num2
    0     5037   9.5   2.0
    1     5320   6.5   7.0
    

    Apply merge

    test_merged = test.merge(train_grp, on='item_id', suffixes=['_test', '_train'])
    
    print(test_merged)
       item_id  num1_test  num2_test  num1_train  num2_train
    0     5037        6.0        NaN         9.5         2.0
    1     5037        3.0        NaN         9.5         2.0
    2     5320        NaN        4.0         6.5         7.0
    3     5320        7.0        9.0         6.5         7.0
    

    Create dictionary of corresponding columns

    test_cols = [col for col in test_merged.columns if 'test' in col]
    train_cols = [col for col in test_merged.columns if 'train' in col]
    dict_cols =dict(zip(test_cols, train_cols))
    
    print(dict_cols)
    {'num1_test': 'num1_train', 'num2_test': 'num2_train'}
    

    Conditionally replace the Nan

    for test, train in dict_cols.items():
        test_merged[test] = np.where(test_merged[test].isnull(), 
                                     test_merged[train], 
                                     test_merged[test])
    
    # Clean up dataframe
    test_merged.drop(train_cols, axis=1, inplace=True)
    test_merged.columns = test_merged.columns.str.replace('_test', '')
    
    print(test_merged)
       item_id  num1  num2
    0     5037   6.0   2.0
    1     5037   3.0   2.0
    2     5320   6.5   4.0
    3     5320   7.0   9.0
    

    Explanation
    np.where works as follows: np.where(condition, value if true, value if false)