I have trained a churn tidymodel with customer data (more than 200 columns). Got a fairly good metrics using xgbboost but the issue is when tryng to predict on new data.
Predict function asks for target variable (churn) and I am a bit confused as this variable is not supposed to be present on real scenario data as this is the variable I want to predict.
sample code below, maybe I missed the point on procedure. Some questions arised:
should I execute prep() at the end of recipe?
should I execute recipe on my new data prior to predict?
why removing lines from recipe regarding target variable makes predict work?
why is asking for my target variable?
churn_recipe <- recipes::recipe(churn ~ ., data = churn_train) %>%
recipes::step_naomit(everything(), skip = TRUE) %>%
recipes::step_rm(c(v1, v2, v3, v4, v5, v6)) %>%
# removing/commenting the next 2 lines makes predict() work
recipes::step_string2factor(churn) %>%
themis::step_downsample(churn) %>%
recipes::step_dummy(all_nominal_predictors()) %>%
recipes::step_novel(all_nominal(), -all_outcomes()) ### %>% prep()
xgboost_model <-
parsnip::boost_tree(
mode = "classification",
trees = 100
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgboost_workflow <-
workflows::workflow() %>%
add_recipe(churn_recipe) %>%
add_model(xgboost_model)
my_fit <- last_fit(xgboost_workflow, churn_split)
collect_metrics(my_fit)
churn_wf_model <- my_fit$.workflow[[1]]
predict(churn_wf_model, new_data[1,])
Error: Can't subset columns that don't exist.
x Column `churn` doesn't exist.
I am pretty sure some misconceptions on my side, but unable to solve this issue.
I am stuck in moving my model into production. Tidymodels documentation lack of such topic is enormous.
You are getting this error because of recipes::step_string2factor(churn)
This step works fine when you are training the data. But when it is time to apply the same transformation to the training set, then step_string2factor()
complains because it is asked to turn churn
from a string to a factor but the dataset doesn't include the churn
variable. You can deal with this in two ways.
skip = FALSE
in step_string2factor()
(less favorable)By setting skip = FALSE
in step_string2factor()
you are telling the step o only be applied to when prepping/training the recipe. This is not favorable as this approach can produce errors in certain resampling scenarios using {tune} when the response is expected to be a factor instead of a string.
library(tidymodels)
data("mlc_churn")
set.seed(1234)
churn_split <- initial_split(mlc_churn)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)
churn_recipe <- recipes::recipe(churn ~ ., data = churn_train) %>%
recipes::step_naomit(everything(), skip = TRUE) %>%
recipes::step_string2factor(churn, skip = TRUE) %>%
themis::step_downsample(churn) %>%
recipes::step_dummy(all_nominal_predictors()) %>%
recipes::step_novel(all_nominal(), -all_outcomes())
xgboost_model <-
parsnip::boost_tree(
mode = "classification",
trees = 100
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgboost_workflow <-
workflows::workflow() %>%
add_recipe(churn_recipe) %>%
add_model(xgboost_model)
my_fit <- last_fit(xgboost_workflow, churn_split)
churn_wf_model <- my_fit$.workflow[[1]]
predict(churn_wf_model, churn_test)
#> # A tibble: 1,250 x 1
#> .pred_class
#> <fct>
#> 1 no
#> 2 no
#> 3 no
#> 4 no
#> 5 no
#> 6 no
#> 7 no
#> 8 no
#> 9 no
#> 10 yes
#> # … with 1,240 more rows
Created on 2021-06-10 by the reprex package (v2.0.0)
The recommended way to fix this issue is to make sure that your response churn
is a factor before you pass it into {recipes}. I find it easiest to do it as I create the validation split with initial_split()
like so. Then you don't need to use step_string2factor()
on your response in your recipe
churn_split <- mlc_churn %>%
mutate(churn = factor(churn)) %>%
initial_split()