I am trying to obtain model fitted values using the predict
function for a glmer
model with splines::bs
terms. The predict
function throws this error:
Error in `[.data.frame`(fr, vars) : undefined columns selected
Here is a reproducible example using simulated data:
# Reproducible example of glmer spline predict problem
library(tibble)
library(dplyr)
library(splines)
library(lme4)
## Simulate data
set.seed(1234L)
n_groups <- 100L
num_per_group <- rpois(n_groups, 50L) + 1L
group_mean <- rnorm(n_groups, sd = 0.2)
x_vals <- c(-20:19)
simulated_data <- tibble(x = sample(x_vals,
sum(num_per_group),
replace = TRUE),
group_no = rep(c(1:n_groups),
num_per_group),
group_mean = rep(group_mean,
num_per_group)) %>%
arrange(x, group_no) %>%
mutate(lin_pred = 3 - 0.05*abs(x) + group_mean,
y = exp(lin_pred + rnorm(sum(num_per_group),
mean = 0, sd = 0.1)),
y = round(y))
# Fit model
glmer_spline <- glmer(y ~ bs(x,
knots = c(0),
degree = 1L) +
(1 | group_no),
data = simulated_data,
family = poisson(link = "log"),
control = glmerControl(optimizer="bobyqa")
)
summary(glmer_spline)
# This throws an error:
# Error in `[.data.frame`(fr, vars) : undefined columns selected
prediction_output_splines <- predict(glmer_spline,
newdata = simulated_data,
type = "response",
allow.new.levels = TRUE,
re.form=~0)
I am using R version 4.2.3 (2023-03-15), tibble 3.2.1, dplyr 1.1.2, lme4 1.1-34, in RStudio "Mountain Hydrangea" Release (547dcf86, 2023-07-06) for macOS on a 2019 MacBook Pro running Ventura 13.4.1.
I would like to know whether there is a mistake in my code, whether I should be using a different function to obtain predictions, whether what I'm trying to do is somehow flawed, or whether there is a bug in one of the packages involved. Any help much appreciated.
TL;DR
The problem comes from non-standard evaluation, which is tripped up by the fact that you are passing 1L
instead of 1
to the degree
argument of bs
Explanation
The error is actually thrown from inside model.frame.mermod
, where the model frame has column names captured from your formula completely as-is:
[1] "y" "bs(x, knots = 0, degree = 1L)"
[3] "group_no"
Whereas the function tries to subset with variable names which have ultimately been obtained from stats::terms.formula
. This puts the arguments through a C function, and in the process the L
is removed from 1L
, resulting in:
[1] "y" "bs(x, knots = 0, degree = 1)"
Since the second string doesn't match the second column name (due to the missing L
), you get the "unselected column names selected" error.
I guess you could call this a bug, but it would be quite a task to rewrite the code simply to cover the possibility that a user writes an unnecessary L
after one of the numbers in the right hand side of their formula.
Solution
All you need to do is change 1L
to 1
, and everything works as expected:
glmer_spline <- glmer(y ~ bs(x,
knots = 0,
degree = 1) +
(1 | group_no),
data = simulated_data,
family = poisson(link = "log"),
control = glmerControl(optimizer="bobyqa")
)
Then you get no errors when you run
prediction_output_splines <- predict(glmer_spline,
newdata = simulated_data,
type = "response",
allow.new.levels = TRUE,
re.form=~0)
Giving you
head(prediction_output_splines)
#> 1 2 3 4 5 6
#> 7.495069 7.495069 7.495069 7.495069 7.495069 7.495069