I understand from reading various answers 1,2,3, that the train
function from caret
will create dummy variables to deal with factors that have multiple levels.
Here is an example using mtcars
(model is useless other than to show point):
library(caret)
library(rattle)
df <- mtcars
df$cyl <- factor(df$cyl)
df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")
tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)
mod <- as.formula(mpg_bound ~ cyl)
set.seed(666)
m1 <- train(mod, data = df,
method = "rpart",
preProcess = c("center", "scale"),
trControl = tc)
fancyRpartPlot(m1$finalModel)
m1$finalModel
n= 32
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 32 14 bad (0.5625000 0.4375000)
2) cyl8>=0.124004 14 0 bad (1.0000000 0.0000000) *
3) cyl8< 0.124004 18 4 good (0.2222222 0.7777778) *
I don't understand this part cyl8>=0.124004
. I get that cyl8
is the dummy variable for the factor but what does it mean that cyl8>=0.124004
?
I'd like to extend the existing answer, because I don't think the conclusion reached in the comments is true. As you say, when using the formula interface, caret's train function will transform factor variables into dummy variables that only take the values 0 or 1, e.g. cyl8 == 1 means 'the car has 8 cylinders'. Each dummy variable makes a statement about a characteristic that is either true or false for the observation.
Rpart will nevertheless output a numeric value as the split criterion, so that cyl8 >= 0.5, cyl8 >= 0.2 and cyl8 == 1 all mean the same thing "This car has exactly 8 cylinders". By default, rpart will choose the split value cyl8 >= 0.5 for binary dummies to indicate that the dummy is true. The interpretation of cyl8 >= 0.5
is then "Does the car have 8 cylinders?" (and not "Does the car have more than 8 cylinders?")
df <- mtcars
df$cyl <- factor(df$cyl)
df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")
library(caret)
tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(166)
m1 <- train(mod, data = df,
method = "rpart",
#preProcess = c("center", "scale"),
trControl = tc,
metric = "ROC")
m1$finalModel
#1) root 32 14 bad (0.5625000 0.4375000)
#2) cyl8>=0.5 14 0 bad (1.0000000 0.0000000) *
#3) cyl8< 0.5 18 4 good (0.2222222 0.7777778) *
The confusing value in your example is caused because caret apparently applies the preProcessing to the extended dataset where the dummies are numeric variables. The interpretation stays the same, but the (arbitrary) split value is transformed.
# Transform to dummies
mm <- model.matrix(mpg_bound ~ .-1, data = df)
# Do pre-processing
pp <- preProcess(mm, method = c("center", "scale"))
mm.pp <- as.matrix(predict(pp, mm))
# Dummy-Split in the middle
(max(mm.pp[,"cyl8"]) + min(mm.pp[,"cyl8"]) ) / 2