Search code examples
pythonencodingrandom-forestone-hot-encoding

When to hot encode ? Employee attrition level example


My question is the following :

I have the following categorical variables in a dataset to predict employee's attrition. categorical variables

I have currently done one hot encoding of : Job level, Job Role, Marital Status, Over18, Overtime, and keeping the same label encoding for the ordinal columns (PerformanceRating,Relationship Satisfaction and JobSatisfaction).

I will then, after splitting into train and test set, use a Random Forrest Classifier to predict the Attrition (Yes/No).

Am I doing encoding the correct way (one hot for categorical and no encoding for the ordinal columns)?

Thank you so much for helping me with this doubt !


Solution

  • First of all, you should doubt everything and take a performance test.

    In general, decision tree models either don't need one-hot encoding or even perform worse after it. There's exceptions, for sure, but in modern ML world tree models if often respected for ability to work with high-dimension categorical data without much of pre-processing.