Search code examples
rtidymodels

R tidymodels - step_novel does not work when combined in a workflow with resampling


I am using step_novel in my workflow so that I avoid errors related to new levels in factors, however it does not appear to work in resampling as I keep getting an error regarding new levels.

Here is a reproducible code:

# Create sample data
library(tidymodels)
library(tidyverse)

set.seed(123)
num_samples <- 100
outcome <- rnorm(num_samples, mean = 50, sd = 10)
numeric_predictor <- rnorm(num_samples, mean = 30, sd = 5)
categorical_predictor <- as.factor(sample(letters[1:4], num_samples, replace = TRUE))

# Create dataframe
df <- data.frame(
  Outcome = outcome,
  Numeric_Predictor = numeric_predictor,
  Categorical_Predictor = categorical_predictor
)


new_row <- data.frame(
  Outcome = 55,
  Numeric_Predictor = NA,
  Categorical_Predictor = "problem"
)

# Add the new row to the dataframe
df <- rbind(df, new_row)


lr_full_preprocessing <- 
  recipe(Outcome ~ ., data = df) %>%
  # knn-imputation 
  step_impute_knn(all_predictors(), neighbors = 1, options = list(nthread = 8, eps = 1e-08)) %>%
  # novel categories
  step_novel(all_nominal_predictors())

lr_full <- linear_reg() %>%
  set_engine("lm")

lr_full_wf <- workflow() %>%
  add_recipe(lr_full_preprocessing, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
  add_model(lr_full)

set.seed(123)  # for reproducibility
folds <- vfold_cv(df, v = 10, repeats = 5)

lr_full_fit_rs <- lr_full_wf %>% 
  fit_resamples(folds)

collect_metrics(lr_full_fit_rs)

I get the following error

> lr_full_fit_rs <- lr_full_wf %>% 
+   fit_resamples(folds)
→ A | error:   factor Categorical_Predictor has new levels problem
There were issues with some computations   A: x5 <---------------------

After some research maybe it is related to this issue/bag https://github.com/tidymodels/recipes/issues/1249 ?


Solution

  • This happens because lm() doesn't handle unseen levels well. It essentially only looks at the levels it observed when fitting, even if the factor levels are different. Filed an issue here https://github.com/tidymodels/parsnip/issues/1084 in case we figure out a better way of handling this

    So setting step_novel() by itself isn't enough. Since lm() will create dummy variables of the categorical predictors, you can "fix" this issue by doing that in the recipe instead.

    Adding step_dummy() and step_zv() will create dummy variables and remove the zero variance predictor which in our case is the novel predictor.

    lr_full_preprocessing <- 
      recipe(Outcome ~ ., data = df) %>%
      # knn-imputation 
      step_impute_knn(all_predictors(), neighbors = 1, options = list(nthread = 8, eps = 1e-08)) %>%
      # novel categories
      step_novel(all_nominal_predictors()) %>%
      step_dummy(all_nominal_predictors())