I am trying to fill nan values from test data with mean value from train data on multiple columns based or grouped by a column. Here is part of the test data:
date_block_num shop_id item_id item_category_id target item_price avg_item_price sum_item_cnt_day avg_item_cnt_day shop_avg_item_price ... avg_item_cnt_day_lag_12 shop_avg_item_price_lag_12 shop_sum_item_cnt_day_lag_12 shop_avg_item_cnt_day_lag_12 category_avg_item_price_lag_12 category_sum_item_cnt_day_lag_12 category_avg_item_cnt_day_lag_12 shop_avg_item_price_per_category_lag_12 shop_sum_item_cnt_per_category_lag_12 shop_avg_item_cnt_per_category_lag_12
0 26.5 5 5037 19.0 0.928571 1788.897788 1934.764286 90.714286 1.937141 868.822366 ... 0.383736 619.341077 181.571429 0.029328 716.813821 779.214286 0.084066 716.052585 10.285714 0.056515
1 NaN 5 5320 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 30.0 5 5233 19.0 1.428571 854.861715 842.778086 72.428571 1.685456 914.767445 ... 0.000000 597.460870 0.000000 0.000000 591.507516 0.000000 0.000000 591.790514 0.000000 0.000000
3 32.0 5 5232 23.0 0.333333 728.018465 790.297277 47.000000 1.100087 965.966832 ... 0.000000 597.460870 0.000000 0.000000 591.507516 0.000000 0.000000 591.790514 0.000000 0.000000
4 NaN 5 5268 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 rows × 102 columns
So I want replace nans with the mean value of a column based on item_id.
first I know that I can get the mean of the columns of the train data grouped by item_id as follows:
mt = train.groupby('item_id').apply(lambda x: np.mean(x))
Then I see that I try to use fillna for each column in test set as follows:
for col in test.columns:
test[col] = test.groupby('item_id')[col].apply(lambda x: x.fillna...)
I cannot get how to replace with mean values from the train and not test. How to do it? And is this the best way or there is a better way? Thanks.
If your train
and test
dataframe have the same shape (# rows, # cols)
We can use pandas.DataFrame.combine_first
for this, but not in combination with the aggregation that happens in groupby
since combine_first
replaces NaN
with the values on the same position in the other dataframe.
Thats why we need to use pandas.DataFrame.transform
for this, since it keeps the shape
of our dataframe the same:
# make two example dataframes
train = pd.DataFrame({'item_id':[5037, 5320, 5037, 5320],
'num1': [10, 8, 9, 5],
'num2': [3, 5, 1, 9]})
test = pd.DataFrame({'item_id':[5037, 5320, 5037, 5320],
'num1': [6, np.NaN, 3, 7],
'num2': [np.NaN, 4, np.NaN, 9]})
print(train, '\n')
print(test)
item_id num1 num2
0 5037 10 3
1 5320 8 5
2 5037 9 1
3 5320 5 9
item_id num1 num2
0 5037 6.0 NaN
1 5320 NaN 4.0
2 5037 3.0 NaN
3 5320 7.0 9.0
We apply groupby.transform
and combine_first
train_means = train.groupby('item_id').transform('mean')
test.combine_first(train_means)
item_id num1 num2
0 5037 6.0 2.0
1 5320 6.5 4.0
2 5037 3.0 2.0
3 5320 7.0 9.0
If your train
and test
dataframe DO NOT have the same shape (# rows, # cols),
it gets a bit more complex.
We can do the following:
item_id
and the valeus with pandas.groupby.mean
pandas.DataFrame.merge
to get the means of each corresponding item_id
and mean of train
dataframe to our test
dataframe.NaN
with the values of the same column which came from our train
dataset where groupby
was applied on. And we use np.where
for this.train_grp = train.groupby('item_id').mean().reset_index()
print(train_grp)
item_id num1 num2
0 5037 9.5 2.0
1 5320 6.5 7.0
Apply merge
test_merged = test.merge(train_grp, on='item_id', suffixes=['_test', '_train'])
print(test_merged)
item_id num1_test num2_test num1_train num2_train
0 5037 6.0 NaN 9.5 2.0
1 5037 3.0 NaN 9.5 2.0
2 5320 NaN 4.0 6.5 7.0
3 5320 7.0 9.0 6.5 7.0
Create dictionary of corresponding columns
test_cols = [col for col in test_merged.columns if 'test' in col]
train_cols = [col for col in test_merged.columns if 'train' in col]
dict_cols =dict(zip(test_cols, train_cols))
print(dict_cols)
{'num1_test': 'num1_train', 'num2_test': 'num2_train'}
Conditionally replace the Nan
for test, train in dict_cols.items():
test_merged[test] = np.where(test_merged[test].isnull(),
test_merged[train],
test_merged[test])
# Clean up dataframe
test_merged.drop(train_cols, axis=1, inplace=True)
test_merged.columns = test_merged.columns.str.replace('_test', '')
print(test_merged)
item_id num1 num2
0 5037 6.0 2.0
1 5037 3.0 2.0
2 5320 6.5 4.0
3 5320 7.0 9.0
Explanation
np.where
works as follows: np.where(condition, value if true, value if false)