In my situation, I would like to encode around 5 different columns in my dataset but the issue is that these 5 columns have many unique values.
If I encode them using label encoder I add an unnecessary order that is not right whereas if I do OHE or pd.get_dummies then I end up having a lot of features that will add to much sparseness in the data.
I am currently dealing with a supervised learning problem and the following are the unique values per column:
Job_Role : Unique categorical values = 29
Country : Unique categorical values = 12
State : Unique categorical values = 14
Segment : Unique categorical values = 12
Unit : Unique categorical values = 10
I have already looked into multiple references but not sure about the best approach. What should in this situation to have least number of features with maximum positive impact on my model
As far as I know, usually uses OneHotEncoder
for these cases but as you said, there are so many unique values in your data. I've looked for a solution for a project before and I saw different ways as follows,
OneHotEncoder + PCA: I think this way is not quite right, because PCA is designed for continuous variables.[*]
Entity Embeddings: I don't know this way very well, but you can check it from the link in the title.
BinaryEncoder: I think, this is useful when you have a large number of categories and doing the one-hot encoding will increase the dimensions and which in turns increases model complexity. So, binary encoding is a good choice to encode the categorical variables with less number of dimensions.
There are some other solutions in category_encoders
library.