Search code examples
pandasnumpyscikit-learnneural-network

How to handle categorical features in neural network?


I am currently having a dataset for the location of stores and name of item to predict sales of a particular product.

I wanted to use binary encoding or pandas get_dummies(), but there are 5000 names for items and it causes memory error, is there any alternative or better way to handle this? Thanks all!

print(train.shape)
print(train.dtypes)
print(train.head())

(125497040, 6)
id               int64
date            object
store_nbr        int64
item_nbr         int64
unit_sales     float64
onpromotion     object
dtype: object
   id        date  store_nbr  item_nbr  unit_sales onpromotion
0   0  2013-01-01         25    103665         7.0         NaN
1   1  2013-01-01         25    105574         1.0         NaN
2   2  2013-01-01         25    105575         2.0         NaN
3   3  2013-01-01         25    108079         1.0         NaN
4   4  2013-01-01         25    108701         1.0         NaN

Solution

  • After so many years, I can answer myself with an updated approach. Other than one hot or truncated category, we can use an embedding. Each category will be represented by a learnable vector. And this learnable vector can be fed to the neural network or any other machine learning algorithm, including decision tree.