Search code examples
predicttidymodelsrecipe

tidymodels: "following required column is missing from `new_data` in step..."


I'm creating and fitting a workflow for a lasso regression model in {tidymodels}. The model fits fine, but when I go to predict the test set I get an error saying "the following required column is missing from `new_data`". Tha column ("price") is in both the train and test sets. Is this a bug? What am I missing?

Any help would be greatly appreciated.

# split the data (target variable in house_sales_df is "price")
split <- initial_split(house_sales_df, prop = 0.8)
train <- split %>% training()
test <-  split %>% testing()

# create and fit workflow
lasso_prep_recipe <-
  recipe(price ~ ., data = train) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric())

lasso_model <- 
  linear_reg(penalty = 0.1, mixture = 1) %>% 
  set_engine("glmnet")

lasso_workflow <- workflow() %>% 
  add_recipe(lasso_prep_recipe) %>% 
  add_model(lasso_model)

lasso_fit <- lasso_workflow %>% 
  fit(data = train)

# predict test set
predict(lasso_fit, new_data = test)

predict() results in this error:

Error in `step_normalize()`:
! The following required column is missing from `new_data` in step 'normalize_MXQEf': price.
Backtrace:
  1. stats::predict(lasso_fit, new_data = test, type = "numeric")
  2. workflows:::predict.workflow(lasso_fit, new_data = test, type = "numeric")
  3. workflows:::forge_predictors(new_data, workflow)
  5. hardhat:::forge.data.frame(new_data, blueprint = mold$blueprint)
  7. hardhat:::run_forge.default_recipe_blueprint(...)
  8. hardhat:::forge_recipe_default_process(...)
 10. recipes:::bake.recipe(object = rec, new_data = new_data)
 12. recipes:::bake.step_normalize(step, new_data = new_data)
 13. recipes::check_new_data(names(object$means), object, new_data)
 14. cli::cli_abort(...)

Solution

  • You are getting the error because all_numeric() in step_normalize() selects the outcome price which isn't avaliable at predict time. Use all_numeric_predictors() and you should be good

    # split the data (target variable in house_sales_df is "price")
    split <- initial_split(house_sales_df, prop = 0.8)
    train <- split %>% training()
    test <-  split %>% testing()
    
    # create and fit workflow
    lasso_prep_recipe <-
      recipe(price ~ ., data = train) %>%
      step_zv(all_predictors()) %>%
      step_normalize(all_numeric_predictors())
    
    lasso_model <- 
      linear_reg(penalty = 0.1, mixture = 1) %>% 
      set_engine("glmnet")
    
    lasso_workflow <- workflow() %>% 
      add_recipe(lasso_prep_recipe) %>% 
      add_model(lasso_model)
    
    lasso_fit <- lasso_workflow %>% 
      fit(data = train)
    
    # predict test set
    predict(lasso_fit, new_data = test)