Search code examples
rclassificationrandom-forestcategorical-data

randomForest in R and factor variables


I have a dataset with some continuous variables, some ordinal variables and some categorical qualitative variables.

I would like to use a random forest classifier (I have a categorical outcome), but I am not sure how to treat the ordinal and categorical features, which are both coded as factor at the moment. I would like the ordinal variables to be considered as numeric and the qualitative ones to have each level as a separate dummy. How does R's randomForest normally handle factor features? Should I transform the qualitative variables into dummies and the ordinal ones into integer or numeric?


Solution

  • Factors are encoded by introducing dummy varaibles that allow for "one-hot" coding. k levels are encoded in k-1 dummy variables. How these represent the levels depends on your choice of the "contrasts" setting. You can test it with contrasts, e.g.

    > contrasts(iris$Species)
               versicolor virginica
    setosa              0         0
    versicolor          1         0
    virginica           0         1
    

    Encoding an ordinal variable as a factor thus adds degrees of freedom, which may or may not be what you want. If you want to keep the information about the ordering of the levels, I would just encode the ordinal variable as an integer.