I am fitting a large multilevel model using bam()
with discrete=TRUE
to speed computation. I then want to predict from that model while including or ignoring certain random effect terms, which I know can be done with the terms
argument to predict.bam()
. But I am finding inconsistent results when I change the discrete
option in predict.bam()
and I am not sure which is correct.
Everything looks good when using all smooth terms. But when selecting only some smooth terms with discrete = TRUE
, the predicted values are the same as a model fitted with gam()
but the standard errors are different. This sometimes inflates them and sometimes reduces the standard errors. Using discrete=FALSE
produces results in line with a model fitted with gam()
. So is there a bug in predict.bam()
somewhere? Which method is calculating things right?
Here is a reproducible example with outputs:
library(lme4)
library(mgcv)
data(sleepstudy)
model <- gam(Reaction ~ Days + s(Subject, bs = "re") + s(Days, Subject, bs = "re"),
data = sleepstudy,
method = "fREML"
)
model_d <- bam(Reaction ~ Days + s(Subject, bs = "re") + s(Days, Subject, bs = "re"),
data = sleepstudy,
method = "fREML"
,discrete=TRUE
)
## including all smooth terms is fine
head(
data.frame(
gam1 = predict(model),
gam2 = predict(model_d, discrete=TRUE),
gam3 = predict(model_d, discrete=FALSE)
)
)
# gam1 gam2 gam3
# 1 252.9178 252.9178 252.9178
# 2 272.7086 272.7086 272.7086
# 3 292.4994 292.4994 292.4994
# 4 312.2901 312.2901 312.2901
# 5 332.0809 332.0809 332.0809
# 6 351.8717 351.8717 351.8717
head(
data.frame(
gam1 = predict(model,se.fit=TRUE)$se.fit,
gam2 = predict(model_d, discrete=TRUE, se.fit=TRUE)$se.fit,
gam3 = predict(model_d, discrete=FALSE, se.fit=TRUE)$se.fit
)
)
# gam1 gam2 gam3
# 1 12.410215 12.410215 12.410215
# 2 10.660886 10.660886 10.660886
# 3 9.191220 9.191220 9.191220
# 4 8.153867 8.153867 8.153867
# 5 7.724996 7.724996 7.724996
# 6 8.003034 8.003034 8.003034
## ---- selecting only some smooth terms
## with discrete = TRUE, predicted values are the same but
## standard errors returned are the same as those with all smooths included.
## This sometimes inflates them and sometimes reduces them.
head(
data.frame(
gam1 = predict(model, terms=c("s(Subject)")),
gam2 = predict(model_d, terms=c("s(Subject)")),
gam3 = predict(model_d, terms=c("s(Subject)"))
)
)
# gam1 gam2 gam3
# 1 252.9178 252.9178 252.9178
# 2 263.3851 272.7086 272.7086
# 3 273.8524 292.4994 292.4994
# 4 284.3197 312.2901 312.2901
# 5 294.7869 332.0809 332.0809
# 6 305.2542 351.8717 351.8717
head(
data.frame(
gam1 = predict(model, terms=c("s(Subject)"),se.fit=TRUE)$se.fit,
gam2 = predict(model_d, terms=c("s(Subject)"), discrete=TRUE, se.fit=TRUE)$se.fit,
gam3 = predict(model_d, terms=c("s(Subject)"), discrete=FALSE, se.fit=TRUE)$se.fit
)
)
# gam1 gam2 gam3
# 1 12.41021 12.410215 12.41021
# 2 12.34846 10.660886 12.34846
# 3 12.48280 9.191220 12.48280
# 4 12.80704 8.153867 12.80704
# 5 13.30733 7.724996 13.30733
# 6 13.96474 8.003034 13.96474
head(
data.frame(
gam1 = predict(model, terms=c("s(Days, Subject)"),se.fit=TRUE)$se.fit,
gam2 = predict(model_d, terms=c("s(Days, Subject)"), discrete=TRUE, se.fit=TRUE)$se.fit,
gam3 = predict(model_d, terms=c("s(Days, Subject)"), discrete=FALSE, se.fit=TRUE)$se.fit
)
)
# gam1 gam2 gam3
# 1 6.885381 12.410215 6.885381
# 2 6.773449 10.660886 6.773449
# 3 7.015357 9.191220 7.015357
# 4 7.577292 8.153867 7.577292
# 5 8.395234 7.724996 8.395234
# 6 9.402609 8.003034 9.402609
I did some more digging and may have answered my own question but would still appreciate anyone who can weigh in with more expertise.
I tried manually computing things using predict(..., method="lpmatrix")
following this helpful blog post by Gavin Simpson.
It looks like the discrete=TRUE
outputs are just wrong and that this is some kind of bug.
This code continues from the prior code:
### ---- manual computation with simulation via lpmatrix
mvrnorm <- MASS::mvrnorm
lp <- predict(model_d, type = "lpmatrix")
coefs <- coef(model_d)
vc <- vcov(model_d)
set.seed(123)
sim <- mvrnorm(5e4, mu = coefs, Sigma = vc)
fits <- lp %*% t(sim)
se.fit <- apply(fits, 1, sd)
## with all effects
head(
data.frame(
gam1 = predict(model,se.fit=TRUE)$se.fit,
gam2 = predict(model_d, discrete=TRUE, se.fit=TRUE)$se.fit,
gam3 = predict(model_d, discrete=FALSE, se.fit=TRUE)$se.fit,
man = se.fit
)
)
# gam1 gam2 gam3 man
# 1 12.410220 12.410215 12.410215 12.453005
# 2 10.660891 10.660886 10.660886 10.704449
# 3 9.191224 9.191220 9.191220 9.235621
# 4 8.153871 8.153867 8.153867 8.198276
# 5 7.724998 7.724996 7.724996 7.767261
# 6 8.003034 8.003034 8.003034 8.040678
## ---- with only s(Subject) random effects
want <- c(c(1,2), grep("s\\(Subject\\)", colnames(lp))) # regex is obnoxious here
fits <- lp[, want] %*% t(sim[, want])
se.fit <- apply(fits, 1, sd)
head(
data.frame(
gam1 = predict(model, terms=c("s(Subject)"),se.fit=TRUE)$se.fit,
gam2 = predict(model_d, terms=c("s(Subject)"), discrete=TRUE, se.fit=TRUE)$se.fit,
gam3 = predict(model_d, terms=c("s(Subject)"), discrete=FALSE, se.fit=TRUE)$se.fit,
man = se.fit
)
)
# gam1 gam2 gam3 man
# 1 12.41022 12.410215 12.41021 12.45300
# 2 12.34847 10.660886 12.34846 12.39594
# 3 12.48280 9.191220 12.48280 12.53395
# 4 12.80704 8.153867 12.80704 12.86074
# 5 13.30733 7.724996 13.30733 13.36248
# 6 13.96474 8.003034 13.96474 14.02039
This is a bug in the discrete prediction code. Fixed for mgcv_1.8-32. Thanks! Simon