Search code examples
machine-learningdecision-treeone-hot-encodinglabel-encoding

Encoding categorical columns - Label encoding vs one hot encoding for Decision trees


The way decision trees and random forest work using splitting logic, I was under the impression that label encoding would not be a problem for these models, as we are anyway going to split the column. For eg: if we have gender as 'male', 'female' and 'other', with label encoding, it becomes 0,1,2 which is interpreted as 0<1<2. But since we are going to split the columns, I thought it didn't matter as it is the same thing whether we are going to split on 'male' or '0'. But when I tried both label and one hot encoding on the dataset, one hot encoding gave better accuracy and precision. Can you kindly share your thoughts.

The ACCURACY SCORE of various models on train and test are:

The accuracy score of simple decision tree on label encoded data :    TRAIN: 86.46%     TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data :     TRAIN: 81.74%     TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data:  TRAIN: 82.26%     TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data :  TRAIN: 86.46%     TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data :   TRAIN: 82.04%     TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41%     TEST: 81.66%

he PRECISION SCORE of various models on train and test are:

The precision score of simple decision tree on label encoded data :             TRAIN: 78.26%   TEST: 57.92%
The precision score of tuned decision tree on label encoded data :              hTRAIN: 66.54%  TEST: 64.6%
The precision score of random forest ensembler on label encoded data:           TRAIN: 70.1%    TEST: 67.44%
The precision score of simple decision tree on one hot encoded data :           TRAIN: 78.26%   TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data :            TRAIN: 68.06%   TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data:         TRAIN: 70.34%   TEST: 67.32%





Solution

  • You can see it as a regularization effect: your model is simpler, and so more generalizable. So you get better performances.

    Taking your example of the sex feature: [male, female, other] with label encoding become [0, 1, 2].

    Now suppose there is a particular configuration of the other features which works only for females: the tree needs two branches to select females, one which select sex bigger than zero, and the other which select sex lower than 2.

    Instead, with one-hot encoding, you only need a branch to do the selection, say sex_female bigger than zero.