Search code examples
rmachine-learningrandom-foresttidymodelsr-parsnip

R Tidymodels classification with Random forest: ERROR while predicting target variable


I have a data set with 90 variables and 200000 obs. It is unbalanced as it has only 4% cases where target variable is 1, in all other cases it is 0.

I split it to 2 sets: fitting(185000) and holdout sample "df_holdout" (15000 obs.) So, I decided to take from the fitting sample for model fitting all cases where target variable = 1 and the same amount of cases where target variable = 0. (in total the set "df" included 25000 obs.)

Variables have names var_01, var_02, var_03, ... var_90 , where var_90 was renamed into "target".

I have a stack of workflows.

This is the code that I use for model fitting:

rf_tune    <- parsnip::rand_forest(mode="classification",
                                                   mtry = tune(), 
                                                   trees = 1000,
                                                   min_n = tune()) %>%
                                                   set_engine("ranger",
                                                              importance = "impurity")
svm_tune              <-  parsnip::svm_poly(mode = "classification",
                                                   engine = "kernlab",
                                                   cost = tune(),
                                                   degree = tune(),
                                                   scale_factor = tune(),
                                                   margin = tune())

  

  # Create data split object
  df_split <- initial_split(df, prop = 0.75,
                            strata = target)
  
  # Create the training data
  df_train <- df_split %>% 
    training()
 
  df_test <- df_split %>% 
    testing()
   
  # create a recipe
  df_recipe <- recipe(target ~., data = df_train) %>% 
    step_zv(all_predictors()) %>%
    step_normalize(all_numeric()) %>% 
    step_corr(threshold = 0.7) %>% 
    step_dummy(all_nominal_predictors(), -all_outcomes())
  
  df_recipe %>% 
    prep(df_train) %>% 
    bake(df_train)

all_models_set <- 
    workflow_set(preproc = list(df_recipe = df_recipe),
                 models =  list(rf_tune,
                                svm_tune),
                 cross = TRUE)

set.seed(123)

  cv <-  vfold_cv(df_training, v=5, repeats=1, strata=target) 
  
  df_metr <- metric_set(accuracy, roc_auc,sens,spec)
  
  
  all_models <-
    all_models_set %>%
    workflow_map("tune_grid",
                 resamples = cv,
                 grid = 10,
                 control =  control_resamples( save_pred = T, save_workflow = T, verbose = T), 
                 metrics = df_metr
    )
  
 
 # Get the workflow ID for the top model from our workflow set
  best_workflow <-
    rank_results(all_models, rank_metric = "roc_auc", select_best = TRUE) %>% 
    filter(.metric=="roc_auc" & rank==1)
  
  
  final_model <-
    extract_workflow_set_result(all_models, pull(best_tuned_workflow, wflow_id)) %>% 
    select_best(metric = "roc_auc") 
  
  
  # Fit final model on Train and predict on Test set
  final_model_pred <- 
    extract_workflow(all_models, pull(best_tuned_workflow, wflow_id)) %>% # extract the workflow
    finalize_workflow(final_model) %>% 
    last_fit(df_split) # fit the model on Train and score on Test
  
  # final workflow extraction
  wf_final_model <- extract_workflow(final_model_pred)

After I created a model and trained the workflow (wf_final_model), I saved it and wanted to use for prediction on a holdout sample. However, when I tried to do it I got an error message:

predict(wf_final_model, df_holdout)

Error: Missing data in columns: var_02_X4, var_02_X7, var_02_X9, var_02_X10, var_02_X11, var_02_X12, var_02_X13, var_02_X15, var_02_X17, var_02_X18, var_02_X20, var_02_X21, var_02_X22, var_02_X23, var_02_X24, var_02_X25, var_02_X26, var_02_X27, var_02_X28, var_02_X29, var_02_X30, var_02_X31, var_02_X33, var_02_X34, var_30_X2, var_30_X3, var_30_X6, var_30_X7, var_30_X9, var_30_X11, var_30_X13, var_30_X14, var_30_X15, var_30_X16, var_30_X17, var_30_X18, var_30_X19, var_30_X20, var_30_X22, var_30_X23, var_30_X24, var_30_X25, var_30_X26, var_30_X27, var_30_X33, var_30_X43, var_30_X46, var_30_X48, var_30_X49, var_30_X51, var_30_X56, var_30_X57, var_30_X60, var_36_X14, var_36_X18, var_36_X21, var_36_X24, var_36_X28, var_36_X29, var_36_X32, var_36_X44, var_36_X57, var_36_X61, var_36_X63, var_36_X85, var_36_X125, var_36_X130, var_36_X136, var_36_X144, var_36_X147, var_36_X148, var_36_X166, var_36_X169, var_36_X171, var_89_X3, var_89_X4, var_89_X5, var_89_X6, var_89_X7, var_89_X8, var_89_X9, va
In addition: Warning messages:
1: Novel levels found in column 'var_02': '2', '5'. The levels have been removed, and values have been coerced to 'NA'. 
2: Novel levels found in column 'var_30': '39', '41', '42', '47', '54'. The levels have been removed, and values have been coerced to 'NA'. 
3: Novel levels found in column 'var_36': '118'. The levels have been removed, and values have been coerced to 'NA'. 
4: Novel levels found in column 'var_89': '2'. The levels have been removed, and values have been coerced to 'NA'. 
5: There are new levels in a factor: NA 
6: There are new levels in a factor: NA 
7: There are new levels in a factor: NA 
8: There are new levels in a factor: NA 

I don't have any variables with such names neither in training set, nor in test or holdout set. As I understand, such variables depict interactions, but I am not sure how to handle it. Can you help me please to fix the error in order to get the predictions?


Solution

  • The variable names you are seeing, var_02_X4, var_02_X7, var_02_X9, var_02_X10, were created by step_dummy(), e.i. var_02 had the levels X4, X7, X9, X10 and so on.

    the way you could deal with this issue, is to add step_unknown() before step_dummy().

      # create a recipe
      df_recipe <- recipe(target ~., data = df_train) %>% 
        step_zv(all_predictors()) %>%
        step_normalize(all_numeric()) %>% 
        step_corr(threshold = 0.7) %>% 
        step_unknown(all_nomial_predictors()) %>%
        step_dummy(all_nominal_predictors()) 
    

    you don't need -all_outcomes() as all_nominal_predictors() doesn't select outcomes.