Search code examples
rcross-validationdata-fittingtidymodelsr-ranger

fit_resamples with ranger package fails


try to use crossfold resampling and fit a random forest from the ranger package. The fit without resampling works but once I try a resample fit it fails with error below.

Consider following df

df<-structure(list(a = c(1379405931, 732812609, 18614430, 1961678341, 
2362202769, 55687714, 72044715, 236503454, 61988734, 2524712675, 
98081131, 1366513385, 48203585, 697397991, 28132854), b = structure(c(1L, 
6L, 2L, 5L, 7L, 8L, 8L, 1L, 3L, 4L, 3L, 5L, 7L, 2L, 2L), .Label = c("CA", 
"IA", "IL", "LA", "MA", "MN", "TX", "WI"), class = "factor"), 
    c = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
    2L, 2L, 2L, 1L), .Label = c("R", "U"), class = "factor"), 
    d = structure(c(3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 
    3L, 2L, 3L, 1L), .Label = c("CAH", "LTCH", "STH"), class = "factor"), 
    e = structure(c(3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 2L, 
    2L, 3L, 3L, 3L), .Label = c("cancer", "general long term", 
    "psychiatric", "rehabilitation"), class = "factor")), row.names = c(NA, 
-15L), class = c("tbl_df", "tbl", "data.frame"))

Following simple fit works without issues

library(tidymodels)
library(ranger)

rf_spec <- rand_forest(mode = 'regression') %>% 
  set_engine('ranger')


rf_spec %>% 
  fit(a ~. , data = df)

But as soon as I want to run the cross validation via

rf_folds <- vfold_cv(df, strata = c)

fit_resamples(a ~ . ,
              rf_spec,
              rf_folds)

Following error

model: Error in parse.formula(formula, data, env = parent.frame()): Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.


Solution

  • The commenter above is correct that the source of the issue here is the spaces in the factor column. The functions for resampling and the functions for just plain old fitting currently handle that differently, and we are actively looking into how to solve this problem for users. Thank you for your patience!

    In the meantime, I would recommend setting up a simple workflow() plus a recipe(), which together will handle all the necessary dummy variable munging for you.

    library(tidymodels)
    
    rf_spec <- rand_forest(mode = "regression") %>% 
      set_engine("ranger")
    
    rf_wf <- workflow() %>%
      add_model(rf_spec) %>%
      add_recipe(recipe(a ~ ., data = df))
    
    
    fit(rf_wf, data = df)
    #> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════════════════
    #> Preprocessor: Recipe
    #> Model: rand_forest()
    #> 
    #> ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────
    #> 0 Recipe Steps
    #> 
    #> ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────
    #> Ranger result
    #> 
    #> Call:
    #>  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
    #> 
    #> Type:                             Regression 
    #> Number of trees:                  500 
    #> Sample size:                      15 
    #> Number of independent variables:  4 
    #> Mtry:                             2 
    #> Target node size:                 5 
    #> Variable importance mode:         none 
    #> Splitrule:                        variance 
    #> OOB prediction error (MSE):       4.7042e+17 
    #> R squared (OOB):                  0.4341146
    
    rf_folds <- vfold_cv(df, strata = c)
    
    fit_resamples(rf_wf,
                  rf_folds)
    #> #  10-fold cross-validation using stratification 
    #> # A tibble: 9 x 4
    #>   splits         id    .metrics         .notes          
    #>   <list>         <chr> <list>           <list>          
    #> 1 <split [13/2]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 2 <split [13/2]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 3 <split [13/2]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 4 <split [13/2]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 5 <split [13/2]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 6 <split [13/2]> Fold6 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 7 <split [14/1]> Fold7 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 8 <split [14/1]> Fold8 <tibble [2 × 3]> <tibble [0 × 1]>
    #> 9 <split [14/1]> Fold9 <tibble [2 × 3]> <tibble [0 × 1]>
    

    Created on 2020-03-20 by the reprex package (v0.3.0)