Search code examples
rmachine-learningtidyversetidymodels

Problem Fitting Model in Tidymodels: Error: ! Can't use NA as column index with `[` at positions 3 and 4:\


I am writing a project in Tidymodels. I have created a train and test set, set out a recipe and a model. When I call workflow(), add the recipe and model, then call fit(data = df_train, I am getting the following error.

Error:
! Can't use NA as column index with `[` at positions 3 and 4.

I am on R version 4.1.3 and R Studio 2022.02.0 Build 443.

For reproducibility, here is the workflow. Note that the data is on GitHub so you will need an internet connection to load the data.

## Load package manager

if(!require(pacman)){
  
  install.packages("pacman")
  
}

## Load required packages. Download them if they do not exist in my system.

pacman::p_load(tidyverse, kableExtra, skimr, knitr, glue, GGally, 
               
               corrplot, tidymodels, themis, stargazer, rpart, rpart.plot, 
               
               vip, patchwork, data.table)

The next step will load the data.

df <- fread('https://raw.githubusercontent.com/Karuitha/data_projects/master/employee_turnover/data/employee_churn_data.csv') %>%

  mutate(left = factor(left, levels = c("yes", "no")))

Next, I slice the data into a training and test set and create a recipe.

## Create a split object consisting 75% of data
split_object <- initial_split(df, prop = 0.75, 
                              
                              strata = left)

## Generate the training set
df_train <- split_object %>%
  
  training()

## Generate the testing set
df_test <- split_object %>%
  
  testing()

###############################################
## Create a recipe
df_recipe <- recipes::recipe(left ~ ., 
                             
                             data = df_train) %>%
  
  ##We upsample the data to balance the outcome variable
  themis::step_upsample(left, 
                        
                        over_ratio = 1, 
                        
                        seed = 500) %>%
  
  ##We make all character variables factors
  step_string2factor(all_nominal_predictors()) %>%
  
  ##We remove one in a pair of highly correlated variables
  ## The threshold for removal is 0.85 (absolute) 
  ## The choice of threshold is subjective. 
  step_corr(all_numeric_predictors(), 
            
            threshold = 0.85) %>%
  
  ## Train these steps on the training data
  prep(training = df_train)

Next, I define a model and attempt to fit.

## Define a logistic model
logistic_model <- logistic_reg() %>%
  
  set_engine("glm") %>%
  
  set_mode("classification")

Then fit it.

workflow() %>% 
  
  add_recipe(df_recipe) %>% 
  
  add_model(logistic_model) %>% 
  
  fit(data = df_train)

This is where I get the error

Error:
! Can't use NA as column index with `[` at positions 3 and 4.

I have checked and rechecked. Any help is welcome.


Solution

  • I am responding to my own question.

    One thing I have realised is that the problem is in the recipe step. When I replace step_str2factor with step_dummy, then everything works fine.

    I still do not know why this is the case. Maybe I will need to study Tidymodels more keenly!!