python salesforce one-hot-encoding forecast

One hot encoding for sales prediction and kernel died

I am trying to forecast sales in daily basis and have a dataset 858935x15 features. Before feeding features to model, I want to make one hot encoding for categorical features. The product related features color, product_id, brand_id, category_id and subcategory_id are provided by int64 format and have below unique values;

------count of unique values -------

productid -> 19359
color -> 2243
categoryid -> 101
brandid -> 868
subcategoryid -> 103

If I made one-hot encode for these features, kernel is dying and the dataset becomes 17.5GB O.O

I guess, the problem is unique value count of product_id, do I strictly need to perform encoding for categorical features or can I leave them as they are especially for product_id?

Solution

I can divide your question into two parts.

How to prevent the kernel from dying.

It dies when the storage limit is exceeded. It definitely does as you produce almost 20000 new columns. Based on my humble experience I can say one-hot encoding will be a very rude technique for the features color and brandid as well, as 1000 new features blows out your data significantly. I suggest you using algorithms, which can perfectly deal with categorical features: tree-based algorithms, such as Random Forest (please refer sklearn implementation) or Gradient Boosting methods. (please refer LightGBM library)

Other ways to deal with categorical features.

If you insist on using linear or neural-based models, there is an amazing technique called mean encoding. You group the observations by each unique value of the feature and substitute the feature values with the mean values of target variable for each group. This technique does lose some information, containing in the feature, however, it does not require the swell of your dataset. This article explains the profits of using this approach.

Hope it helps.