Search code examples
rr-caret

Interpreting dummy variables created in caret train


I understand from reading various answers 1,2,3, that the train function from caret will create dummy variables to deal with factors that have multiple levels.

Here is an example using mtcars (model is useless other than to show point):

library(caret)
library(rattle)

df <- mtcars

df$cyl <- factor(df$cyl)
df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")

tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)

mod <- as.formula(mpg_bound ~ cyl)

set.seed(666)

m1 <- train(mod, data = df, 
            method = "rpart", 
            preProcess = c("center", "scale"),
            trControl = tc)

fancyRpartPlot(m1$finalModel)

m1$finalModel

n= 32 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 32 14 bad (0.5625000 0.4375000)  
  2) cyl8>=0.124004 14  0 bad (1.0000000 0.0000000) *
  3) cyl8< 0.124004 18  4 good (0.2222222 0.7777778) *

I don't understand this part cyl8>=0.124004. I get that cyl8 is the dummy variable for the factor but what does it mean that cyl8>=0.124004?


Solution

  • I'd like to extend the existing answer, because I don't think the conclusion reached in the comments is true. As you say, when using the formula interface, caret's train function will transform factor variables into dummy variables that only take the values 0 or 1, e.g. cyl8 == 1 means 'the car has 8 cylinders'. Each dummy variable makes a statement about a characteristic that is either true or false for the observation.
    Rpart will nevertheless output a numeric value as the split criterion, so that cyl8 >= 0.5, cyl8 >= 0.2 and cyl8 == 1 all mean the same thing "This car has exactly 8 cylinders". By default, rpart will choose the split value cyl8 >= 0.5 for binary dummies to indicate that the dummy is true. The interpretation of cyl8 >= 0.5 is then "Does the car have 8 cylinders?" (and not "Does the car have more than 8 cylinders?")

    df <- mtcars
    
    df$cyl <- factor(df$cyl)
    df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")
    
    library(caret)
    tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)
    set.seed(166)
    m1 <- train(mod, data = df, 
            method = "rpart", 
            #preProcess = c("center", "scale"),
            trControl = tc,
            metric = "ROC")
    
    m1$finalModel
       #1) root 32 14 bad (0.5625000 0.4375000)  
         #2) cyl8>=0.5 14  0 bad (1.0000000 0.0000000) *
         #3) cyl8< 0.5 18  4 good (0.2222222 0.7777778) *
    

    The confusing value in your example is caused because caret apparently applies the preProcessing to the extended dataset where the dummies are numeric variables. The interpretation stays the same, but the (arbitrary) split value is transformed.

    # Transform to dummies
    mm <- model.matrix(mpg_bound ~ .-1, data = df)
    
    # Do pre-processing
    pp <- preProcess(mm, method = c("center", "scale"))
    mm.pp <- as.matrix(predict(pp, mm))
    
    # Dummy-Split in the middle
    (max(mm.pp[,"cyl8"]) + min(mm.pp[,"cyl8"]) ) / 2