Encoding nominal categories with high cardinality

I am making a prediction on unit price for different stock codes present in a dataset. There are approx 3000 different stock codes which are already label encoded with 1 - 3000.

I have a question here. Fields Stockcode or country are categorical features and they are encoded as serial numeric values just like simple label encoding. However, they are nominal features, not ordinal. Should we encode these features by some technique like mean encoding or frequency encoding, will that help in any way? Otherwise, this label encoded labels will be interpreted by some kind of ordered relationship by the machine isn't it?

Is it a fact that for a tree based model, for categorical features(nominal) label encoding is just enough?

Solution

There are models that can handle without problem categorical features, like decision tree, random forest, etc.

If you are using other models, like, for example, neural networks, or SVM this would be a problem. These models work with an euclidean representation of your input features

In the following image, for example, you have an euclidean representation of input points with two features, pressure and age.

If you have a categorical feature like country, you could have numerical encoding like this:

{"England": 0, "France": 1, "Spain": 2, "Italy": 3}

You are forcing some kind of order on your categorical values. For example, in this encoding France is between England and Spain, which means that Spain is somehow "greater" than England and France, and that England is "smaller" then France and Spain. Which of course does not make sense in your euclidean space.

One solution to this problem is to make a one hot encoding, which means you are creating a binary feature for every label in your categorical feature.

For our example, you could make this encoding:

Country_England  Country_France  Country_Spain  Country_Italy
0               0                0              1
0               1                0              0
1               0                0              0
0               0                1              0
0               1                0              0

This would somehow make your model handle in a more significant way your categorical feature.

This approach has unfortunately a lot of drawbacks. It makes your features explode. If your categorical feature has 100 unique values, this means 100 more features. And this would lead to a lot of problem, to increased model complexity and to the unfamous curse of dimensionality

In my opinion, if you have a lot of categorical features, the best approach would be to use model capable to handle such input, like random forest, decision tree etc.

Or if you want to use such features in your model, consider the use of one hot encoding + feature selection, to reduce the space complexity and enhance performances

If you want to use one hot encoding in python there are a lot of libraries. But I suggest you this one, from sklearn.