Search code examples
machine-learningdata-sciencedecision-treeh2ocategorical-data

Is there a way to use decision trees with categorical variables without one-hot encoding?


I have a dataset with 200+ categorical variables (non-ordinal) and just a few continuous variables. I have tried to use one-hot encoding but that increases the dimensions by a lot and results in a poor score.
It seems like the regular scikit-learn tree can only be used with categorical variables that has been transformed into one-hot encoding (for non-ordinal vars) and I was if there's a way to create a tree without one-hot. I did some research and found that there's an API called h2o that might be useful, but I'm trying to find a way to run it on my local machine.


Solution

  • you can install the h2o-3 package for python, for example, from h2o.ai/downloads or from pypi.

    the h2o package handles categorical values automatically efficiently. it is recommended to not one-hot-encode them first.

    you can find lots of documentation at docs.h2o.ai.