Search code examples
rtime-seriesxgboosttidymodels

Recipe for XGBoost tidymodels. Error: unused argument (values)


Currently I am doing some experiments with hyperparameter tuning for XGBoost regression on time series, using a latin hypercube sampling strategy. When running the code below, all the models fail during the tune_grid operation. The cause seems to be the recipe object. I used step_dummy() to transform the value column of my univariate time series In the .notes object appears the Error message: preprocessor 1/1: Error: unused argument (values)

I found some other post where this issue popped up, but none of the solutions helped in my case.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
library(timetk)
library(tidymodels)
library(modeltime)
library(tictoc)


dates <- ymd("2016-01-01")+ months(0:59)
fake_values <- 
  c(64,61, 90,138,240,141,123, 9,180,95,84,69,76,104,122,183,200,268,225,
    132,84,159,64,131,98,138,179,187,303,257,175,133,145,36,3,134,137,308,
    84,114,310,266,123,131,87,94,86,100,105,147,159,232,312,337,285,188,257,10,98,27
  )
df <- bind_cols(fake_values, dates) %>% 
  rename(c(values = ...1, dates = ...2)
  )

# training- and test set
data_splits <- initial_time_split(df, prop = 0.8)
data_train  <- training(data_splits)
data_test   <- testing(data_splits)

resampling_strategy <- 
  data_train %>%
  time_series_cv(
    initial = "12 months",
    assess = "3 months",
    skip = "3 months",
    cumulative  = TRUE,
    slice_limit = 3
)

# recipe
basic_rec <- recipe(values ~ ., data = data_train)  %>% 
  step_dummy(all_nominal(values), -all_outcomes()) 

basic_rec %>% prep()

Solution

  • It looks like the problem is that those date predictors aren't getting converted to numeric values, which xgboost needs. You did use step_dummy() but dates are not factor/nominal variables so they are not getting chosen by all_nominal(). If you explicitly choose them, this is what happens:

    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    library(lubridate)
    #> 
    #> Attaching package: 'lubridate'
    #> The following objects are masked from 'package:base':
    #> 
    #>     date, intersect, setdiff, union
    
    dates <- ymd("2016-01-01") + months(0:59)
    fake_values <- 
      c(64,61, 90,138,240,141,123, 9,180,95,84,69,76,104,122,183,200,268,225,
        132,84,159,64,131,98,138,179,187,303,257,175,133,145,36,3,134,137,308,
        84,114,310,266,123,131,87,94,86,100,105,147,159,232,312,337,285,188,257,10,98,27
      )
    df <- bind_cols(fake_values, dates) %>% 
      rename(c(values = ...1, dates = ...2)
      )
    #> New names:
    #> * NA -> ...1
    #> * NA -> ...2
    
    # training- and test set
    data_splits <- initial_time_split(df, prop = 0.8)
    data_train  <- training(data_splits)
    data_test   <- testing(data_splits)
    
    basic_rec <- recipe(values ~ ., data = data_train) %>% 
      step_dummy(dates) 
    
    basic_rec %>% prep() %>% bake(new_data = NULL)
    #> Warning: The following variables are not factor vectors and will be ignored:
    #> `dates`
    #> Error: The `terms` argument in `step_dummy` did not select any factor columns.
    

    Created on 2021-10-27 by the reprex package (v2.0.1)

    You probably want to handle dates with something like step_date().