Search code examples
rr-caretrpart

Using ordinal variables in rpart and caret without converting to dummy categorical variables


I am trying to create an ordinal regression tree in R using rpart, with the predictors mostly being ordinal data, stored as factor in R.

When I created the tree using rpart, I get something like this:

enter image description here

where the values are the factor values (E.g. A170 has labels ranging from -5 to 10).

However, when I use caret to train the data using rpart, when I extract the final model, the tree no longer has ordinal predictors. See below for a sample output tree

enter image description here

As you see above, it seems the ordinal variable A170 now has been converted into multiple dummy categorical value, i.e. A17010 in the second tree is a dummy for A170 of value 10.

So, is it possible to retain ordinal variables instead of converting factor variables into multiple binary indicator variables when fitting trees with the caret package?


Solution

  • Let's start with a reproducible example:

    set.seed(144)
    dat <- data.frame(x=factor(sample(1:6, 10000, replace=TRUE)))
    dat$y <- ifelse(dat$x %in% 1:2, runif(10000) < 0.1, ifelse(dat$x %in% 3:4, runif(10000) < 0.4, runif(10000) < 0.7))*1
    

    As you note, training with the rpart function groups the factor levels together:

    library(rpart)
    rpart(y~x, data=dat)
    

    enter image description here

    I was able to reproduce the caret package splitting up the factors into their individual levels using the formula interface to the train function:

    library(caret)
    train(y~x, data=dat, method="rpart")$finalModel
    

    enter image description here

    The solution I found to avoid splitting factors by level is to input raw data frames to the train function instead of using the formula interface:

    train(x=data.frame(dat$x), y=dat$y, method="rpart")$finalModel
    

    enter image description here