I'm learning tidymodels. The following code runs nicely:
library(tidyverse)
library(tidymodels)
# Draw a random sample of 2000 to try the models
set.seed(1234)
diamonds <- diamonds %>%
sample_n(2000)
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")
metric <- metric_set(rmse,rsq,mae)
# Model KNN
knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = tune("k"),
engine = "kknn"
)
knn_rec <-
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
knn_wflow <-
workflow() %>%
add_model(knn_spec) %>%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res <-
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric="rmse")
# Best KNN
best_knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = 10,
engine = "kknn"
)
best_knn_wflow <-
workflow() %>%
add_model(best_knn_spec) %>%
add_recipe(knn_rec)
best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)
But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error : "Error in step_log()
:
! The following required column is missing from new_data
in step 'log_mUSAb': price.
Run rlang::last_trace()
to see where the error occurred."
# Predict Manually
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)
This problem is related to log transform outcome variable in tidymodels workflow
For log transformations to the outcome, we strongly recommend that those transformation be done before you pass them to the recipe()
. This is because you are not guaranteed to have an outcome when predicting (which is what happens when you last_fit()
a workflow) on new data. And the recipe fails.
You are seeing this here as when you predict on a workflow()
object, it only passes the predictors, as it is all that it needs. Hence why you see this error.
Since log transformations isn't a learned transformation you can safely do it before.
diamonds_train$price <- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
diamonds_test$price <- log(diamonds_test$price)
}