R: using a lmer model in fit_resamples() fails with "Error: Assigned data `factor(lvl[1], levels = lvl)` must be compatible with existing data."

I am trying to use the tidymodels package to build a linear mixed model. It looks like I'm specifying the formula in the correct way, as I can run a fit() on the workflow. However, when I try to run it on resamples of my data, using the function fit_resamples(), I get an error relative to missing factor levels.

It is not clear to me if I'm doing something wrong, or if the packages "multilevelmod" and "tune" are not compatible in this way, and so I would greatly appreciate any advice.

I have included a reprex using the "mpg" dataset.

EDIT: After looking more into the problem, and especially at this question: Prediction with lme4 on new levels I found out how to make lmer models predict on new value combinations, by using the allow.new.levels = TRUE argument in the predict() function. My question is then, how can I specify this inside workflow() or fit_resamples()?

I guessed that adding step_novel() in the recipe and default_recipe_blueprint(allow_novel_levels = TRUE) in the add_recipe() call would be sufficient (as in tidymodels Novel levels found in column), but it still doesn't seem to work.

library(tidyverse)
library(tidymodels)
library(multilevelmod)

set.seed(1243)

data(mpg, package = "ggplot2")

training = mpg %>%
  initial_split() %>%
  training() %>% 
  mutate(manufacturer = manufacturer %>% as_factor(),
         model = model %>% as_factor())

training_folds = training %>%
  validation_split()

lmm_model = linear_reg() %>% 
  set_engine("lmer")

lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
  step_novel(manufacturer, model)

lmm_formula = cty ~ year + (1|manufacturer/model)

lmm_workflow = workflow() %>% 
  # see: https://stackoverflow.com/questions/68183077/tidymodels-novel-levels-found-in-column
  add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>% 
  add_model(lmm_model, formula = lmm_formula)

# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#> 
#> * step_novel()
#> 
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#>    Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#>  Groups             Name        Std.Dev.
#>  model:manufacturer (Intercept) 2.881   
#>  manufacturer       (Intercept) 2.675   
#>  Residual                       2.181   
#> Number of obs: 175, groups:  model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept)         year  
#>    -8.06202      0.01228

# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)  
#> # A tibble: 1 x 4
#>   splits           id         .metrics .notes          
#>   <list>           <chr>      <list>   <list>          
#> 1 <split [131/44]> validation <NULL>   <tibble [1 x 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x1:  ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#> 
#> Use `collect_notes(object)` for more information.

# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])

identical(
  analysis_set %>% distinct(manufacturer, model),
  assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE

# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)

# predicting the values for the missing combinations of levels is actually possible
# see: https://stackoverflow.com/questions/29259750/prediction-with-lme4-on-new-levels
assessment_predict = analysis_fit %>%
  predict(training_folds$splits[[1]] %>% assessment(),
          allow.new.levels = TRUE)

^{Created on 2022-05-24 by the reprex package (v2.0.1)}

Solution

I believe the issue is that you are ending up with different factor levels in the training vs. testing sets (which are called analysis and assessment for resamples, in tidymodels):

library(tidymodels)
data(mpg, package = "ggplot2")

training <- mpg %>%
  initial_split() %>%
  training()

training_folds <- training %>%
  validation_split()

training %>% count(manufacturer, model)
#> # A tibble: 38 × 3
#>    manufacturer model                  n
#>    <chr>        <chr>              <int>
#>  1 audi         a4                     5
#>  2 audi         a4 quattro             5
#>  3 audi         a6 quattro             3
#>  4 chevrolet    c1500 suburban 2wd     4
#>  5 chevrolet    corvette               4
#>  6 chevrolet    k1500 tahoe 4wd        3
#>  7 chevrolet    malibu                 5
#>  8 dodge        caravan 2wd            6
#>  9 dodge        dakota pickup 4wd      8
#> 10 dodge        durango 4wd            5
#> # … with 28 more rows
analysis(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 37 × 3
#>    manufacturer model                  n
#>    <chr>        <chr>              <int>
#>  1 audi         a4                     1
#>  2 audi         a4 quattro             4
#>  3 audi         a6 quattro             3
#>  4 chevrolet    c1500 suburban 2wd     4
#>  5 chevrolet    corvette               3
#>  6 chevrolet    k1500 tahoe 4wd        2
#>  7 chevrolet    malibu                 4
#>  8 dodge        caravan 2wd            5
#>  9 dodge        dakota pickup 4wd      5
#> 10 dodge        durango 4wd            4
#> # … with 27 more rows
assessment(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 28 × 3
#>    manufacturer model                   n
#>    <chr>        <chr>               <int>
#>  1 audi         a4                      4
#>  2 audi         a4 quattro              1
#>  3 chevrolet    corvette                1
#>  4 chevrolet    k1500 tahoe 4wd         1
#>  5 chevrolet    malibu                  1
#>  6 dodge        caravan 2wd             1
#>  7 dodge        dakota pickup 4wd       3
#>  8 dodge        durango 4wd             1
#>  9 dodge        ram 1500 pickup 4wd     1
#> 10 ford         expedition 2wd          3
#> # … with 18 more rows

^{Created on 2022-05-23 by the reprex package (v2.0.1)}

When you fit the model one time with all the training data, you have 38 combinations of manufacturer and model. When you fit using your validation split, you have 37 combinations in the training/analysis set, and 28 combinations in the testing/assessment set. Some of these don't overlap, so the model can't predict for the observation that is in assessment but analysis.