Search code examples
rr-factorr-caret

Does extractPrediction() support factors?


I am trying to use a random forest model as one of several models that I am testing including neural networks (nnet and neuralnet) all using the convenient caret package. The random forest model supports the use of factors, so for this model, rather than convert the factors to numeric contrasts using dummyVars(), I thought I'd just keep them as factors. This works fine in the training step (train() ):

library(caret)

#Set dependent
seed = 123
y = "Sepal.Length"

#Partition (iris) data into train and test sets
set.seed(seed)
train.idx = createDataPartition(y = iris[,y], p = .8, list = FALSE)
train.set = iris[train.idx,]
test.set = iris[-train.idx,]

train.set = data.frame(train.set)
test.set = data.frame(test.set)

#Select features
features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")
mod.features = paste(features, collapse = " + ")

#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))

#Train model
mod <- train(mod.formula, data = train.set,
             method = "rf")

but when I try to use extractPrediction(), it fails:

#Test model with extractPrediction()
testPred = extractPrediction(models = list(mod),
                             testX = test.set[,features],
                             testY = test.set[,y])

Error in predict.randomForest(modelFit, newdata) : variables in the training data missing in newdata

Now, as far as I can see, this is because during the call to train(), 1-hot encoding / contrasts are created for the factors, and as such some new variable names are created. It seems that the base predict() method works fine even with factors:

#Test model with predict()
testPred = predict(mod$finalModel, 
                   newData = test.set[, features])

And when I convert my factors to numeric contrasts using dummyVars(), extractPrediction() works fine:

#Train and test model using dummyVar
data.dummies = dummyVars(~.,data = iris)
data = predict(data.dummies, newdata = iris)

set.seed(seed)
train.idx = createDataPartition(y = data[,y], p = .8, list = FALSE)
train.set = data[train.idx,]
test.set = data[-train.idx,]

features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species.setosa",
             "Species.versicolor", "Species.virginica")
mod.features = paste(features, collapse = " + ")

#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))

train.set = data.frame(train.set)
test.set = data.frame(test.set)

mod <- train(mod.formula, data = train.set,
             method = "rf")

testPred = extractPrediction(models = list(mod),
                             testX = test.set[,features],
                             testY = test.set[,y])

Can anyone explain to me why this is? It would be great to get extractPrediction() working with factors for use in my multi-model testing pipeline. I suppose I could just convert everything using dummyVars() at the begining, but I'm intrigued to know why extractPrediction() isn't working with factors in this case, even when predict() does work.


Solution

  • If you use the default function interface instead of the one which uses a formula, you should be in business.

    set.seed(1234)
    mod_formula <- train(
        Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
      , data = iris
      , method = "rf")
    
    test_formula <- extractPrediction(
        models = list(mod_formula)
    )
    
    set.seed(1234)
    mod_default <- train(
        y = iris$Sepal.Length
      , x = iris[, c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species')]
      , method = "rf")
    
    test_default <- extractPrediction(
      models = list(mod_default)
    )