Search code examples
machine-learningmodelrandom-forestone-hot-encodingboosting

why tree-based model do not need one-hot encoding for nominal data?


We usually do one-hot encoding for nominal data to make it more reasonable to count the distance among features or the weight, but I often heard that tree-based model like random forest or boosting model do not need do one-hot encoding but I have searched the Internet and have no idea, can anyone told me why or guide me some materials to figure it out?


Solution

  • but I often heard that tree-based model like random forest or boosting model do not need do one-hot encoding

    This is not necessarily true, as some implementations will apply different logic to numerical and categorical variables, so it is best to to encode categorical variables appropriately for the library you are using.

    However, it sometimes might be OK to use numerical encoding for decision tree models, because they are just looking for places to split the data, they are not multiplying inputs by weights, for example. Contrast this with a neural network that would interpret red=1, blue=2 as meaning that blue is twice red, which is obviously not what you want.