Search code examples
rggplot2regressiondecision-treetidymodels

Tidymodels Package: Visualising a random forest model using ggplot() to show the most important predictors


Overview

I am following a tutorial (see below) to find the best fit models from bagged trees, random forests, boosted trees, and general linear models.

Tutorial (see examples below)

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

Issue

In this case, I would like to explore the data further and visualise the most important predictors (see diagram below) for my data in the random forest model.

My data frame is called FID and the predictors in the random forest model involve:

  1. Year (numeric)
  2. Month (Factor)
  3. Days (numeric)

The dependent variable is Frequency (numeric)

When I try to run the plot to visualise the most important predictor, I keep on getting this error message:-

Error: Problem with `mutate()` input `oob_rmse`.
x non-numeric argument to mathematical function
ℹ Input `oob_rmse` is `map_dbl(fit, ~sqrt(.x$prediction.error))`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)

If anyone has any advice on how to fix the error message, I would be deeply appreciative.

Many thanks in advance

Examples of how to produce the plot from the R-code in the tutorial

enter image description here enter image description here

Visualise the model

enter image description here

Plot to show the most important predictors from the R-code in the tutorial

enter image description here

My R-code

##Open libraries
library(tidymodels)
library(parsnip)
library(forcats)
library(ranger)
library(baguette)

###########################################################
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

 # resample the data with 10-fold cross-validation (10-fold by default)
  cv <- vfold_cv(train_data, v=3)
###########################################################

##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###################################################################################


    ###################################################
    ##Random forests
    ###################################################
    
    mod_rf <-rand_forest(trees = 1e3) %>%
                                  set_engine("ranger",
                                  num.threads = parallel::detectCores(), 
                                  importance = "permutation", 
                                  verbose = TRUE) %>% 
                                  set_mode("regression") 
                                  
    ##Create Workflow
    
    wflow_rf <- workflow() %>% 
                   add_model(mod_rf) %>% 
                         add_recipe(rec)
    
    ##Fit the model
    
    plan(multisession)
    
    fit_rf<-fit_resamples(
                 wflow_rf,
                 cv,
                 metrics = metric_set(rmse, rsq),
                 control = control_resamples(save_pred = TRUE,
                                             extract = function(x) extract_model(x)))
    
    
    # extract roots
    rf_tree_roots <- function(x){
                         map_chr(1:1000, 
                            ~ranger::treeInfo(x, tree = .)[1, "splitvarName"])
                                }
    
    rf_roots <- function(x){
                           x %>% 
                            dplyr::select(.extracts) %>% 
                            unnest(cols = c(.extracts)) %>% 
                            dplyr::mutate(fit = map(.extracts,
                            ~.x$fit$fit$fit),
                            oob_rmse = map_dbl(fit,
                                  ~sqrt(.x$prediction.error)),
                             roots = map(fit, 
                            ~rf_tree_roots(.))
                                   ) %>% 
                            dplyr::select(roots) %>% 
                            unnest(cols = c(roots))
                            }
    
    ##Open a plotting window
    dev.new()
    
    # plot
    rf_roots(fit_rf) %>% 
                group_by(roots) %>% 
                count() %>% 
                dplyr::arrange(desc(n)) %>% 
                dplyr::filter(n > 75) %>% 
                ggplot(aes(fct_reorder(roots, n), n)) +
                geom_col() + 
                coord_flip() + 
                labs(x = "root", y = "count")

##Error message

Error: Problem with `mutate()` input `oob_rmse`.
x non-numeric argument to mathematical function
ℹ Input `oob_rmse` is `map_dbl(fit, ~sqrt(.x$prediction.error))`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)

Data frame - FID

  structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
"April", "May", "June", "July", "August", "September", "October", 
"November", "December"), class = "factor"), Frequency = c(36, 
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, 
-36L), class = "data.frame")

Solution

  • It was not about to extract the errors correctly, if you look at the tibble containing all the models:

    fit_rf$.extracts
    [[1]]
    # A tibble: 1 x 1
      .extracts
      <list>   
    1 <ranger> 
    

    It's embedded within a list or list, but there's no names:

    names(fit_rf$.extracts[[1]][[1]])
    NULL
    

    Hence this part will fail:

    map(fit_rf$.extracts,~.x$fit$fit$fit)
    

    If you look at the structure after the first unnest, this is already the fit:

    fit_rf %>% dplyr::select(.extracts) %>% unnest(cols = c(.extracts)) 
    # A tibble: 3 x 1
      .extracts
      <list>   
    1 <ranger> 
    2 <ranger> 
    3 <ranger> 
    

    So we can do:

    rf_roots <- function(x){
                           x %>% 
                           select(.extracts) %>% 
                           unnest(cols = c(.extracts)) %>% 
                           mutate(oob_rmse = map_dbl(.extracts,
                                      ~sqrt(.x$prediction.error)),
                                  roots = map(.extracts, 
                                      ~rf_tree_roots(.))
                                   ) %>% 
                            dplyr::select(roots) %>% 
                            unnest(cols = c(roots))
                            }
    

    This will work now:

    rf_roots(fit_rf)
    # A tibble: 3,000 x 1
       roots          
       <chr>          
     1 Month_August   
     2 Year           
     3 Month_July     
     4 Month_September
     5 Month_December 
     6 Month_March    
     7 Month_July     
     8 Month_September
     9 Month_December 
    10 Days        
    

    Add-on: If the objective is to get the root variable for each tree in each model, one can simply do:

    root_vars = unnest(fit_rf,.extracts) %>% 
    pull(.extracts) %>% 
    map(rf_tree_roots)
    

    Or in base R:

    lapply(fit_rf$.extracts,function(i)rf_tree_roots(i[[1]][[1]]))
    

    And you can easily unlist this to make a barplot.