I am trying to use the tidymodels package to build a linear mixed model. It looks like I'm specifying the formula in the correct way, as I can run a fit() on the workflow. However, when I try to run it on resamples of my data, using the function fit_resamples(), I get an error relative to missing factor levels.
It is not clear to me if I'm doing something wrong, or if the packages "multilevelmod" and "tune" are not compatible in this way, and so I would greatly appreciate any advice.
I have included a reprex using the "mpg" dataset.
EDIT: After looking more into the problem, and especially at this question: Prediction with lme4 on new levels I found out how to make lmer models predict on new value combinations, by using the allow.new.levels = TRUE argument in the predict() function. My question is then, how can I specify this inside workflow() or fit_resamples()?
I guessed that adding step_novel() in the recipe and default_recipe_blueprint(allow_novel_levels = TRUE) in the add_recipe() call would be sufficient (as in tidymodels Novel levels found in column), but it still doesn't seem to work.
library(tidyverse)
library(tidymodels)
library(multilevelmod)
set.seed(1243)
data(mpg, package = "ggplot2")
training = mpg %>%
initial_split() %>%
training() %>%
mutate(manufacturer = manufacturer %>% as_factor(),
model = model %>% as_factor())
training_folds = training %>%
validation_split()
lmm_model = linear_reg() %>%
set_engine("lmer")
lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
step_novel(manufacturer, model)
lmm_formula = cty ~ year + (1|manufacturer/model)
lmm_workflow = workflow() %>%
# see: https://stackoverflow.com/questions/68183077/tidymodels-novel-levels-found-in-column
add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(lmm_model, formula = lmm_formula)
# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#>
#> * step_novel()
#>
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#> Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#> Groups Name Std.Dev.
#> model:manufacturer (Intercept) 2.881
#> manufacturer (Intercept) 2.675
#> Residual 2.181
#> Number of obs: 175, groups: model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept) year
#> -8.06202 0.01228
# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)
#> # A tibble: 1 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [131/44]> validation <NULL> <tibble [1 x 3]>
#>
#> There were issues with some computations:
#>
#> - Error(s) x1: ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#>
#> Use `collect_notes(object)` for more information.
# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])
identical(
analysis_set %>% distinct(manufacturer, model),
assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE
# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)
# predicting the values for the missing combinations of levels is actually possible
# see: https://stackoverflow.com/questions/29259750/prediction-with-lme4-on-new-levels
assessment_predict = analysis_fit %>%
predict(training_folds$splits[[1]] %>% assessment(),
allow.new.levels = TRUE)
Created on 2022-05-24 by the reprex package (v2.0.1)
I believe the issue is that you are ending up with different factor levels in the training vs. testing sets (which are called analysis and assessment for resamples, in tidymodels):
library(tidymodels)
data(mpg, package = "ggplot2")
training <- mpg %>%
initial_split() %>%
training()
training_folds <- training %>%
validation_split()
training %>% count(manufacturer, model)
#> # A tibble: 38 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 5
#> 2 audi a4 quattro 5
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 4
#> 6 chevrolet k1500 tahoe 4wd 3
#> 7 chevrolet malibu 5
#> 8 dodge caravan 2wd 6
#> 9 dodge dakota pickup 4wd 8
#> 10 dodge durango 4wd 5
#> # … with 28 more rows
analysis(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 37 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 1
#> 2 audi a4 quattro 4
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 3
#> 6 chevrolet k1500 tahoe 4wd 2
#> 7 chevrolet malibu 4
#> 8 dodge caravan 2wd 5
#> 9 dodge dakota pickup 4wd 5
#> 10 dodge durango 4wd 4
#> # … with 27 more rows
assessment(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 28 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 4
#> 2 audi a4 quattro 1
#> 3 chevrolet corvette 1
#> 4 chevrolet k1500 tahoe 4wd 1
#> 5 chevrolet malibu 1
#> 6 dodge caravan 2wd 1
#> 7 dodge dakota pickup 4wd 3
#> 8 dodge durango 4wd 1
#> 9 dodge ram 1500 pickup 4wd 1
#> 10 ford expedition 2wd 3
#> # … with 18 more rows
Created on 2022-05-23 by the reprex package (v2.0.1)
When you fit the model one time with all the training data, you have 38 combinations of manufacturer and model. When you fit using your validation split, you have 37 combinations in the training/analysis set, and 28 combinations in the testing/assessment set. Some of these don't overlap, so the model can't predict for the observation that is in assessment but analysis.