Search code examples
rtidyversetidymodels

tidymodel error, when calling predict function is asking for target variable


I have trained a churn tidymodel with customer data (more than 200 columns). Got a fairly good metrics using xgbboost but the issue is when tryng to predict on new data.

Predict function asks for target variable (churn) and I am a bit confused as this variable is not supposed to be present on real scenario data as this is the variable I want to predict.

sample code below, maybe I missed the point on procedure. Some questions arised:

  1. should I execute prep() at the end of recipe?

  2. should I execute recipe on my new data prior to predict?

  3. why removing lines from recipe regarding target variable makes predict work?

  4. why is asking for my target variable?

        churn_recipe <- recipes::recipe(churn ~ ., data = churn_train) %>%
           recipes::step_naomit(everything(), skip = TRUE) %>% 
           recipes::step_rm(c(v1, v2, v3, v4, v5, v6)) %>%
        #  removing/commenting the next 2 lines makes predict() work
           recipes::step_string2factor(churn) %>%  
           themis::step_downsample(churn) %>%
           recipes::step_dummy(all_nominal_predictors()) %>% 
           recipes::step_novel(all_nominal(), -all_outcomes())  ### %>% prep()
    
            xgboost_model <-
               parsnip::boost_tree(
                 mode = "classification",
                 trees = 100
               ) %>%
               set_engine("xgboost") %>% 
               set_mode("classification")
    
             xgboost_workflow <-
               workflows::workflow() %>%
               add_recipe(churn_recipe) %>% 
               add_model(xgboost_model) 
    
               my_fit <- last_fit(xgboost_workflow, churn_split)
    
               collect_metrics(my_fit)
    
    
               churn_wf_model <- my_fit$.workflow[[1]]
    
             predict(churn_wf_model, new_data[1,])
             Error: Can't subset columns that don't exist.
             x Column `churn` doesn't exist.
    

I am pretty sure some misconceptions on my side, but unable to solve this issue.

I am stuck in moving my model into production. Tidymodels documentation lack of such topic is enormous.


Solution

  • You are getting this error because of recipes::step_string2factor(churn)

    This step works fine when you are training the data. But when it is time to apply the same transformation to the training set, then step_string2factor() complains because it is asked to turn churn from a string to a factor but the dataset doesn't include the churn variable. You can deal with this in two ways.

    skip = FALSE in step_string2factor() (less favorable)

    By setting skip = FALSE in step_string2factor() you are telling the step o only be applied to when prepping/training the recipe. This is not favorable as this approach can produce errors in certain resampling scenarios using {tune} when the response is expected to be a factor instead of a string.

    library(tidymodels)
    
    data("mlc_churn")
    
    set.seed(1234)
    churn_split <- initial_split(mlc_churn)
    
    churn_train <- training(churn_split)
    churn_test <- testing(churn_split)
    
    
    churn_recipe <- recipes::recipe(churn ~ ., data = churn_train) %>%
       recipes::step_naomit(everything(), skip = TRUE) %>% 
       recipes::step_string2factor(churn, skip = TRUE) %>%  
       themis::step_downsample(churn) %>%
       recipes::step_dummy(all_nominal_predictors()) %>% 
       recipes::step_novel(all_nominal(), -all_outcomes())
    
    xgboost_model <-
      parsnip::boost_tree(
        mode = "classification",
        trees = 100
      ) %>%
      set_engine("xgboost") %>% 
      set_mode("classification")
    
    xgboost_workflow <-
      workflows::workflow() %>%
      add_recipe(churn_recipe) %>% 
      add_model(xgboost_model) 
    
    my_fit <- last_fit(xgboost_workflow, churn_split)
    
    churn_wf_model <- my_fit$.workflow[[1]]
    
    predict(churn_wf_model, churn_test)
    #> # A tibble: 1,250 x 1
    #>    .pred_class
    #>    <fct>      
    #>  1 no         
    #>  2 no         
    #>  3 no         
    #>  4 no         
    #>  5 no         
    #>  6 no         
    #>  7 no         
    #>  8 no         
    #>  9 no         
    #> 10 yes        
    #> # … with 1,240 more rows
    

    Created on 2021-06-10 by the reprex package (v2.0.0)

    Make response a factor before splitting (recommended)

    The recommended way to fix this issue is to make sure that your response churn is a factor before you pass it into {recipes}. I find it easiest to do it as I create the validation split with initial_split() like so. Then you don't need to use step_string2factor() on your response in your recipe

    churn_split <- mlc_churn %>%
      mutate(churn = factor(churn)) %>%
      initial_split()