Search code examples
rxgboostone-hot-encodingtidymodels

Error in validate_column_names(): Missing required columns after applying recipe in Tidymodels workflow with XGBoost


I'm encountering an issue when using tidymodels with xgboost in a workflow. After applying a recipe that includes step_dummy() to convert categorical variables into dummy variables, I receive the following error when trying to make predictions:

Error in `validate_column_names()`:
! The following required columns are missing: 'A', 'B', 'C', 'D'.

Here's a simplified version of my code:

library(tidymodels)
library(xgboost)
library(dplyr)

set.seed(123)
datensatz <- tibble(
  outcome = rnorm(100, mean = 60, sd = 10),
  A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
  B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
  C = factor(sample(1:3, 100, replace = TRUE)),
  D = factor(sample(c("a", "b"), 100, replace = TRUE))
)

# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)


# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%  
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric_predictors())  

prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)

# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
  trees = 1000,                    
  tree_depth = 6,                  
  min_n = 10,                      
  loss_reduction = 0.01,           
  sample_size = 0.8,               
  mtry = 0.8,                      
  learn_rate = 0.01                
) %>%
  set_mode("regression") %>%
  set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)

# Workflow
workflow_obj <- workflow() %>%
  add_recipe(recipe_obj) %>%
  add_model(xgboost_spec)

# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)

# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data_prepared)

# Ergebnisse 
predictions
# Error occurs here

I suspect the issue is related to the fact that step_dummy() removes the original categorical columns (A, B, C, D) and replaces them with dummy variables. However, the workflow seems to expect the original columns when making predictions.

How can I resolve this issue and ensure that the prediction step correctly uses the dummy variables created by step_dummy()?

Additional Info:

I'm using the `xgboost engine` within the `tidymodels` framework.
The error message suggests that the workflow expects the original categorical variables, but these are no longer present after applying `step_dummy()`.

Solution

  • If you are using a recipe in a workflow, then you don't need to manually prep() and bake() the test data set. So you can delete the following lines

    prepared_recipe <- prep(recipe_obj)
    test_data_prepared <- bake(prepared_recipe, new_data = test_data)
    

    and predict with predict(xgboost_fit, new_data = test_data) instead of predict(xgboost_fit, new_data = test_data_prepared)

    library(tidymodels)
    library(xgboost)
    library(dplyr)
    
    set.seed(123)
    datensatz <- tibble(
      outcome = rnorm(100, mean = 60, sd = 10),
      A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
      B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
      C = factor(sample(1:3, 100, replace = TRUE)),
      D = factor(sample(c("a", "b"), 100, replace = TRUE))
    )
    
    # splitting
    data_split <- initial_split(datensatz, prop = 0.75)
    train_data <- training(data_split)
    test_data <- testing(data_split)
    
    # Rezept
    recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
      step_dummy(all_nominal(), -all_outcomes()) %>%  
      step_zv(all_predictors()) %>%  
      step_normalize(all_numeric_predictors())  
    
    # XGBoost Modell Spezifikation
    xgboost_spec <- boost_tree(
      trees = 1000,                    
      tree_depth = 6,                  
      min_n = 10,                      
      loss_reduction = 0.01,           
      sample_size = 0.8,               
      mtry = 0.8,                      
      learn_rate = 0.01                
    ) %>%
      set_mode("regression") %>%
      set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)
    
    # Workflow
    workflow_obj <- workflow() %>%
      add_recipe(recipe_obj) %>%
      add_model(xgboost_spec)
    
    # Modell trainieren
    xgboost_fit <- fit(workflow_obj, data = train_data)
    
    # Modellvorhersage auf den vorbereiteten Testdaten
    predictions <- predict(xgboost_fit, new_data = test_data)
    
    # Ergebnisse 
    predictions
    #> # A tibble: 25 × 1
    #>    .pred
    #>    <dbl>
    #>  1  62.9
    #>  2  58.2
    #>  3  57.8
    #>  4  59.5
    #>  5  60.0
    #>  6  61.9
    #>  7  58.2
    #>  8  61.4
    #>  9  60.7
    #> 10  54.9
    #> # ℹ 15 more rows
    

    Created on 2024-08-30 with reprex v2.1.1