R: Predicting with new factor levels in mlr with regr.svm task

I am using the mlr package to predict from an SVM. If my validation set contains factor levels not present in my training data, the prediction fails, regardless of how I set fix.factors.prediction when making the SVM learner.

What is the proper way to handle this? Using e1071::svm() will return a response for new factor levels, but how can I do the same with mlr methods?




# Split data
train_set <- sample_frac(iris, 4/5)
valid_set <- setdiff(iris, train_set)

# Remove all "setosa" values from the training set
train_set[train_set$Species == "setosa", "Species"] <- 
  sample(c("virginica", "versicolor"), 
         sum(train_set$Species == "setosa"), replace = TRUE)    
# Fit model
iris_task <- makeRegrTask(data = train_set, target = "Petal.Width")

svm_lrn <- makeLearner("regr.svm", fix.factors.prediction = TRUE)

svm_mod <- train(svm_lrn, iris_task)

# Predict on new factor levels
predict(svm_mod, newdata = valid_set)

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 29, 20

When using makeLearner("regr.svm", fix.factors.prediction = FALSE), I get the following error from the call to predict:

Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", : length of 'center' must equal the number of columns of 'x'

Things that do work

I can generate predictions when subsetting to factor levels in the training set:

predict(svm_mod, newdata = valid_set %>% 
          filter(Species %in% train_set$Species))

No error when using a different learner:

nnet_lrn <- makeLearner("regr.nnet", fix.factors.prediction = TRUE)
nnet_mod <- train(nnet_lrn, iris_task)
predict(nnet_mod, newdata = valid_set)

Or when using the same learner directly from the package:

e1071_mod <- 
  e1071::svm(Petal.Width ~ Sepal.Length + Sepal.Width +
               Petal.Length + Species, train_set)
predict(e1071_mod, newdata = valid_set)

  • Ok, this has been a little challenging. A few things upfront:

    • e1071::svm() cannot handle missing factor levels in newdata (Error in predict.svm: test data does not match model)
    • The manual execution of your example only runs because you did not drop the unused factor levels in train_data
    • argument fix.factor.predictions did not do what its supposed to. I posted a temporary fix in this branch. The fix is very dirty and just a proof of concept. I might clean it up.

    Proof of non-working manual execution:

    # Split data
    train_set <- sample_frac(iris, 4 / 5)
    valid_set <- setdiff(iris, train_set)
    # Remove all "setosa" values from the training set
    train_set[train_set$Species == "setosa", "Species"] <-
      sample(c("virginica", "versicolor"),
        sum(train_set$Species == "setosa"), replace = TRUE)
    # this is important
    train_set = droplevels(train_set)
    e1071_mod <- e1071::svm(Petal.Width ~ Sepal.Length + Sepal.Width +
      Petal.Length + Species, train_set)
    predict(e1071_mod, newdata = valid_set)
    #> Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", : length of 'center' must equal the number of columns of 'x'

    Working example using the provided fix in mlr:

    #> Downloading GitHub repo mlr-org/mlr@fix-factors
    # Split data
    train_set <- sample_frac(iris, 4 / 5)
    valid_set <- setdiff(iris, train_set)
    # Remove all "setosa" values from the training set
    train_set[train_set$Species == "setosa", "Species"] <-
      sample(c("virginica", "versicolor"),
        sum(train_set$Species == "setosa"), replace = TRUE)
    # this is important
    train_set = droplevels(train_set)
    # Fit model
    iris_task <- makeRegrTask(data = train_set, target = "Petal.Width")
    svm_lrn <- makeLearner("regr.svm", fix.factors.prediction = TRUE)
    svm_mod <- train(svm_lrn, iris_task)
    # Predict on new factor levels
    predict(svm_mod, newdata = valid_set)
    #> Prediction: 30 observations
    #> predict.type: response
    #> threshold: 
    #> time: 0.00
    #>   truth  response
    #> 1   0.3 0.2457751
    #> 2   0.1 0.2730398
    #> 3   0.2 0.2717464
    #> 4   0.1 0.2717748
    #> 5   0.1 0.2651599
    #> 6   0.4 0.2582568
    #> ... (#rows: 30, #cols: 2)

