I am writing a project in Tidymodels
. I have created a train
and test
set, set out a recipe
and a model
. When I call workflow()
, add the recipe
and model
, then call fit(data = df_train
, I am getting the following error.
Error:
! Can't use NA as column index with `[` at positions 3 and 4.
I am on R version 4.1.3 and R Studio 2022.02.0 Build 443.
For reproducibility, here is the workflow. Note that the data is on GitHub so you will need an internet connection to load the data.
## Load package manager
if(!require(pacman)){
install.packages("pacman")
}
## Load required packages. Download them if they do not exist in my system.
pacman::p_load(tidyverse, kableExtra, skimr, knitr, glue, GGally,
corrplot, tidymodels, themis, stargazer, rpart, rpart.plot,
vip, patchwork, data.table)
The next step will load the data.
df <- fread('https://raw.githubusercontent.com/Karuitha/data_projects/master/employee_turnover/data/employee_churn_data.csv') %>%
mutate(left = factor(left, levels = c("yes", "no")))
Next, I slice the data into a training and test set and create a recipe.
## Create a split object consisting 75% of data
split_object <- initial_split(df, prop = 0.75,
strata = left)
## Generate the training set
df_train <- split_object %>%
training()
## Generate the testing set
df_test <- split_object %>%
testing()
###############################################
## Create a recipe
df_recipe <- recipes::recipe(left ~ .,
data = df_train) %>%
##We upsample the data to balance the outcome variable
themis::step_upsample(left,
over_ratio = 1,
seed = 500) %>%
##We make all character variables factors
step_string2factor(all_nominal_predictors()) %>%
##We remove one in a pair of highly correlated variables
## The threshold for removal is 0.85 (absolute)
## The choice of threshold is subjective.
step_corr(all_numeric_predictors(),
threshold = 0.85) %>%
## Train these steps on the training data
prep(training = df_train)
Next, I define a model and attempt to fit.
## Define a logistic model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Then fit it.
workflow() %>%
add_recipe(df_recipe) %>%
add_model(logistic_model) %>%
fit(data = df_train)
This is where I get the error
Error:
! Can't use NA as column index with `[` at positions 3 and 4.
I have checked and rechecked. Any help is welcome.
I am responding to my own question.
One thing I have realised is that the problem is in the recipe
step. When I replace step_str2factor
with step_dummy
, then everything works fine.
I still do not know why this is the case. Maybe I will need to study Tidymodels more keenly!!