Search code examples
rtidymodels

Problem when scoring new data -- tidymodels


I'm learning tidymodels. The following code runs nicely:

library(tidyverse)
library(tidymodels)

# Draw a random sample of 2000 to try the models

set.seed(1234)

diamonds <- diamonds %>%    
  sample_n(2000)
  
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")

diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")

metric <- metric_set(rmse,rsq,mae)

# Model KNN 

knn_spec <-
  nearest_neighbor(
    mode = "regression", 
    neighbors = tune("k"),
    engine = "kknn"
  ) 

knn_rec <-
  recipe(price ~ ., data = diamonds_train) %>%
  step_log(all_outcomes()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors())

knn_wflow <- 
  workflow() %>% 
  add_model(knn_spec) %>%
  add_recipe(knn_rec)

knn_grid = expand.grid(k=c(1,5,10,30))

knn_res <- 
  tune_grid(
    knn_wflow,
    resamples = folds,
    metrics = metric,
    grid = knn_grid
  )

collect_metrics(knn_res)
autoplot(knn_res)

show_best(knn_res,metric="rmse")

# Best KNN 

best_knn_spec <-
  nearest_neighbor(
    mode = "regression", 
    neighbors = 10,
    engine = "kknn"
  ) 

best_knn_wflow <- 
  workflow() %>% 
  add_model(best_knn_spec) %>%
  add_recipe(knn_rec)

best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)

collect_metrics(best_knn_fit)

But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error : "Error in step_log(): ! The following required column is missing from new_data in step 'log_mUSAb': price. Run rlang::last_trace() to see where the error occurred."

# Predict Manually

f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)

Solution

  • This problem is related to log transform outcome variable in tidymodels workflow

    For log transformations to the outcome, we strongly recommend that those transformation be done before you pass them to the recipe(). This is because you are not guaranteed to have an outcome when predicting (which is what happens when you last_fit() a workflow) on new data. And the recipe fails.

    You are seeing this here as when you predict on a workflow() object, it only passes the predictors, as it is all that it needs. Hence why you see this error.

    Since log transformations isn't a learned transformation you can safely do it before.

    diamonds_train$price <- log(diamonds_train$price)
    
    if (!is.null(diamonds_test$price)) {
      diamonds_test$price <- log(diamonds_test$price)
    }