Search code examples
rregressiondecision-treeglmtidymodels

Tidymodel Package: General linear models (glm) and decision tree (bagged trees, boosted trees, and random forest) models in R


Issue

I am attempting to undertake an analysis using the Tidymodels Package in R. I am following this tutorial below regarding decision tree learning in R:-

Tutorial

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

I have a data frame called FID (see below) where the dependent variable is the frequency (numeric), and the predictor variables are:- Year (numeric), Month (factor), Monsoon (factor), and Days (numeric).

I believe I have successfully followed the tutorial named "Tidymodels: Decision Tree Learning in R" by building a bagged tree, random forest, and boosted tree model.

For this analysis, I would also like to construct a general linear model (glm) in order to make model comparisons between all models (i.e the random forest, bagged tree, boosted tree, and general linear models) to establish the best model fit. All models are subject to 10-fold cross-validation to decrease the bias of overfitting.

Problem

Subsequently, I have attempted to adapt the code (please see below) from the tutorial to fit a glm model, but I am confused whether I have tuned the model appropriately. I am unsure if this element of glm R-code is producing the error message below when I am attempting to produce the rmse values after the models have all been fit:-

Error message

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.

In addition, I am unsure if the R code expressed in the recipe() function for these models is adequate or correct, which is very important during the processing steps before fitting each model. From my perspective, I was wondering if the recipe for the models could be improved.

If this is possible, I was wondering if anyone could please help me regarding the error message when fitting the glm model, in conjunction with correcting the recipe (if this is necessary).

I am not an advanced R coder, and I have made a thorough attempt to try and find a solution by researching other Tidymodel tutorials; but, I do not understand what this error message means. Therefore, if anyone is able to help, I would like to express my deepest appreciation.

Many thanks in advance.

R-Code

##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)

###########################################################
# Put 3/4 of the data into the training set
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe
##Preprocessing
############################################################

rec <- recipe(Frequency ~ ., data = fid_df) %>% 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

####Error message

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
Run `rlang::last_error()` to see where the error occurred.

#####################################################
##Out-of-sample performance
#####################################################

# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

Data Frame - FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
    Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
    33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
    41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
    28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
    7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
    29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")

Solution

  • I believe the error from fitting the linear model is coming from how Month and Monsoon are related to each other. They are perfectly correlated:

    library(tidyverse) 
    
    fid_df <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
                                      2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
                                      2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
                                      2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
                                                                                                     2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
                                                                                                     5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
                                                                                                     8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
                                                                                                                                        "April", "May", "June", "July", "August", "September", "October", 
                                                                                                                                        "November", "December"), class = "factor"), Monsoon = structure(c(2L, 
                                                                                                                                                                                                          2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 
                                                                                                                                                                                                          4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 
                                                                                                                                                                                                          3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon", 
                                                                                                                                                                                                                                  "Second_Inter_Monsoon", "South_Monsson"), class = "factor"), 
                             Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 
                                           33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 
                                           41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 
                                                                                           28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 
                                                                                           7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 
                                                                                           29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")
    
    
    fid_df %>%
      count(Month, Monsoon)
    #>        Month              Monsoon n
    #> 1    January        North_Monsoon 3
    #> 2   February        North_Monsoon 3
    #> 3      March First_Inter_Monssoon 3
    #> 4      April First_Inter_Monssoon 3
    #> 5        May        South_Monsson 3
    #> 6       June        South_Monsson 3
    #> 7       July        South_Monsson 3
    #> 8     August        South_Monsson 3
    #> 9  September        South_Monsson 3
    #> 10   October Second_Inter_Monsoon 3
    #> 11  November Second_Inter_Monsoon 3
    #> 12  December        North_Monsoon 3
    

    If you use variables like this in a linear model, the model cannot find estimates for both sets of coefficients:

    lm(Frequency ~ ., data = fid_df) %>% summary()
    #> 
    #> Call:
    #> lm(formula = Frequency ~ ., data = fid_df)
    #> 
    #> Residuals:
    #>      Min       1Q   Median       3Q      Max 
    #> -15.0008  -3.9357   0.6564   2.9769  12.7681 
    #> 
    #> Coefficients: (3 not defined because of singularities)
    #>                               Estimate Std. Error t value Pr(>|t|)  
    #> (Intercept)                 -7286.9226  3443.9292  -2.116   0.0459 *
    #> Year                            3.6155     1.7104   2.114   0.0461 *
    #> MonthFebruary                  -3.2641     6.6172  -0.493   0.6267  
    #> MonthMarch                      0.1006     6.5125   0.015   0.9878  
    #> MonthApril                      0.4843     6.5213   0.074   0.9415  
    #> MonthMay                       -4.0308    11.0472  -0.365   0.7187  
    #> MonthJune                       1.0135    15.5046   0.065   0.9485  
    #> MonthJuly                      -2.6910    13.6106  -0.198   0.8451  
    #> MonthAugust                    -4.9307     6.6172  -0.745   0.4641  
    #> MonthSeptember                 -1.7105     7.1126  -0.240   0.8122  
    #> MonthOctober                   -7.6981     6.5685  -1.172   0.2538  
    #> MonthNovember                  -8.7484     6.5493  -1.336   0.1953  
    #> MonthDecember                  -1.6981     6.5685  -0.259   0.7984  
    #> MonsoonNorth_Monsoon                NA         NA      NA       NA  
    #> MonsoonSecond_Inter_Monsoon         NA         NA      NA       NA  
    #> MonsoonSouth_Monsson                NA         NA      NA       NA  
    #> Days                            1.1510     0.4540   2.535   0.0189 *
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> Residual standard error: 7.968 on 22 degrees of freedom
    #> Multiple R-squared:  0.8135, Adjusted R-squared:  0.7033 
    #> F-statistic: 7.381 on 13 and 22 DF,  p-value: 2.535e-05
    

    Created on 2020-11-18 by the reprex package (v0.3.0.9001)

    Since you have this info, I would recommend using some domain knowledge to decide whether to use Month or Monsoon in the model but not both.