Search code examples
rmachine-learningregressionk-foldtidymodels

Tidymodels: Tunable models involving 10-fold Cross Validation Using the Function tune_grid() in R


Overview

I have produced four models using the tidymodels package with the data frame FID (see below):

  1. General Linear Model
  2. Bagged Tree
  3. Random Forest
  4. Boosted Trees

The data frame contains three predictors:

  1. Year (numeric)
  2. Month (Factor)
  3. Days (numeric)

The dependent variable is Frequency (numeric)

The original penalty was 0.1 for regularization, which I picked somewhat arbitrarily. My aim is to estimate the right or best regularization parameter penalty. The idea is to estimate the model hyperparameter (best value model), which cannot be assessed during model training. I am attempting to estimate the best penalty value by training many models on resampled data sets and exploring how well they perform. Consequently, I am building a new model specification for model tuning.

I am following this tutorial:-

https://smltar.com/mlregression.html#firstregressionevaluation

I am experiencing this error message

Error: A `model` action has already been added to this workflow.

#Run rlang::last_error()

<error/rlang_error>
A `model` action has already been added to this workflow.
Backtrace:
  1. tune::tune_grid(...)
 10. workflows::add_model(., tune_spec_glm)
 11. workflows:::add_action(x, action, "model")
 13. workflows:::add_action_impl.action_fit(x, action, name)
 14. workflows:::check_singleton(x$fit$actions, name)
 15. workflows:::glubort("A `{name}` action has already been added to this workflow.")
Run `rlang::last_trace()` to see the full context.

If anyone can help me solve this issue, I would be deeply appreciative.

Many thanks.

R-code

##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)


#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
  
##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################

##Produce the glm model
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

    ##Fit the glm model

###########################################################################

MODEL EVALUATION

##Estimate how well that model performs, let’s fit many times, 
##once to each of these resampled folds, and then evaluate on the heldout 
##part of each resampled fold.
##########################################################################
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##Collect model predictions for each K-fold for the number of blue whale sightings

Predictions<-fit_glm %>% 
                    collect_predictions()

#######Tuning hyperparameters

##Estimating the best regularization penalty to configure the best value model 
##by estimating the best value by training many models on resamples data sets
##and exploring how well these models perform

tune_spec_glm <- linear_reg(penalty = tune(), mixture = 1) %>%
                                          set_mode("regression") %>%
                                                       set_engine("glmnet")

tune_spec_glm

##Create a regular grid of value to try using a convenience function for 
##penalty

lambda_grid <- grid_regular(penalty(), levels = 30)

lambda_grid

####

tune_rs <- tune_grid(
                    wflow_glm %>% add_model(tune_spec_glm),
                    cv,
                    grid = lambda_grid,
                    control = control_resamples(save_pred = TRUE)
                    )

##Error message

Error: A `model` action has already been added to this workflow.
Run `rlang::last_error()` to see where the error occurred.

Data Frame - FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
"April", "May", "June", "July", "August", "September", "October", 
"November", "December"), class = "factor"), Frequency = c(36, 
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, 
-36L), class = "data.frame")

Solution

  • You should use update_model() instead of add_model().

    tune_rs <- tune_grid(
                        wflow_glm %>% update_model(tune_spec_glm),
                        cv,
                        grid = lambda_grid,
                        control = control_resamples(save_pred = TRUE)
                        )
    

    May I also pose some general comments about your example:

    1. I modified the below lines
    train_data <- training(FID)
    test_data  <- testing(FID)
    

    to

    train_data <- training(data_split)
    test_data  <- testing(data_split)
    

    I guess this is a typo when you made the example for this question, because it gives an error.

    1. The recipe should be trained on the training split, otherwise there would be data leakage.
      In your code, it actually won't matter, as the training, prep(), is performed in the workflow, which uses the train data
    rec <- recipe(Frequency ~ ., data = train_data) %>% 
    
    1. You could use poisson regression for your problem, since the outcome is positive integers. In tidymodels, you can use poissonreg::poisson_reg() https://poissonreg.tidymodels.org/