Search code examples
rstatisticspredictgammgcv

mgcv: standard errors differ in predict.bam() with discrete = true


I am fitting a large multilevel model using bam() with discrete=TRUE to speed computation. I then want to predict from that model while including or ignoring certain random effect terms, which I know can be done with the terms argument to predict.bam(). But I am finding inconsistent results when I change the discrete option in predict.bam() and I am not sure which is correct.

Everything looks good when using all smooth terms. But when selecting only some smooth terms with discrete = TRUE, the predicted values are the same as a model fitted with gam() but the standard errors are different. This sometimes inflates them and sometimes reduces the standard errors. Using discrete=FALSE produces results in line with a model fitted with gam(). So is there a bug in predict.bam() somewhere? Which method is calculating things right?

Here is a reproducible example with outputs:

library(lme4)
library(mgcv)

data(sleepstudy)

model <- gam(Reaction ~ Days + s(Subject, bs = "re") + s(Days, Subject, bs = "re"),
                data = sleepstudy,
                method = "fREML"
)

model_d <- bam(Reaction ~ Days + s(Subject, bs = "re") + s(Days, Subject, bs = "re"),
                 data = sleepstudy,
                 method = "fREML"
                 ,discrete=TRUE
)

## including all smooth terms is fine
head(
    data.frame(
        gam1 = predict(model),
        gam2 = predict(model_d, discrete=TRUE),
        gam3 = predict(model_d, discrete=FALSE)
    )
)
# gam1     gam2     gam3
# 1 252.9178 252.9178 252.9178
# 2 272.7086 272.7086 272.7086
# 3 292.4994 292.4994 292.4994
# 4 312.2901 312.2901 312.2901
# 5 332.0809 332.0809 332.0809
# 6 351.8717 351.8717 351.8717

head(
    data.frame(
        gam1 = predict(model,se.fit=TRUE)$se.fit,
        gam2 = predict(model_d, discrete=TRUE, se.fit=TRUE)$se.fit,
        gam3 = predict(model_d, discrete=FALSE, se.fit=TRUE)$se.fit
    )
)
# gam1      gam2      gam3
# 1 12.410215 12.410215 12.410215
# 2 10.660886 10.660886 10.660886
# 3  9.191220  9.191220  9.191220
# 4  8.153867  8.153867  8.153867
# 5  7.724996  7.724996  7.724996
# 6  8.003034  8.003034  8.003034

## ---- selecting only some smooth terms
## with discrete = TRUE, predicted values are the same but 
## standard errors returned are the same as those with all smooths included.
## This sometimes inflates them and sometimes reduces them.

head(
    data.frame(
        gam1 = predict(model, terms=c("s(Subject)")),
        gam2 = predict(model_d, terms=c("s(Subject)")),
        gam3 = predict(model_d, terms=c("s(Subject)"))
    )
)

# gam1     gam2     gam3
# 1 252.9178 252.9178 252.9178
# 2 263.3851 272.7086 272.7086
# 3 273.8524 292.4994 292.4994
# 4 284.3197 312.2901 312.2901
# 5 294.7869 332.0809 332.0809
# 6 305.2542 351.8717 351.8717

head(
    data.frame(
        gam1 = predict(model, terms=c("s(Subject)"),se.fit=TRUE)$se.fit,
        gam2 = predict(model_d, terms=c("s(Subject)"), discrete=TRUE, se.fit=TRUE)$se.fit,
        gam3 = predict(model_d, terms=c("s(Subject)"), discrete=FALSE, se.fit=TRUE)$se.fit
    )
)

# gam1      gam2     gam3
# 1 12.41021 12.410215 12.41021
# 2 12.34846 10.660886 12.34846
# 3 12.48280  9.191220 12.48280
# 4 12.80704  8.153867 12.80704
# 5 13.30733  7.724996 13.30733
# 6 13.96474  8.003034 13.96474

head(
    data.frame(
        gam1 = predict(model, terms=c("s(Days, Subject)"),se.fit=TRUE)$se.fit,
        gam2 = predict(model_d, terms=c("s(Days, Subject)"), discrete=TRUE, se.fit=TRUE)$se.fit,
        gam3 = predict(model_d, terms=c("s(Days, Subject)"), discrete=FALSE, se.fit=TRUE)$se.fit
    )
)

# gam1      gam2     gam3
# 1 6.885381 12.410215 6.885381
# 2 6.773449 10.660886 6.773449
# 3 7.015357  9.191220 7.015357
# 4 7.577292  8.153867 7.577292
# 5 8.395234  7.724996 8.395234
# 6 9.402609  8.003034 9.402609

Updated

I did some more digging and may have answered my own question but would still appreciate anyone who can weigh in with more expertise.

I tried manually computing things using predict(..., method="lpmatrix") following this helpful blog post by Gavin Simpson.

It looks like the discrete=TRUE outputs are just wrong and that this is some kind of bug.

This code continues from the prior code:


### ---- manual computation with simulation via lpmatrix
mvrnorm <- MASS::mvrnorm

lp <- predict(model_d, type = "lpmatrix")

coefs <- coef(model_d)
vc <- vcov(model_d)

set.seed(123)
sim <- mvrnorm(5e4, mu = coefs, Sigma = vc)

fits <- lp %*% t(sim)

se.fit <- apply(fits, 1, sd)

## with all effects
head(
    data.frame(
        gam1 = predict(model,se.fit=TRUE)$se.fit,
        gam2 = predict(model_d, discrete=TRUE, se.fit=TRUE)$se.fit,
        gam3 = predict(model_d, discrete=FALSE, se.fit=TRUE)$se.fit,
        man = se.fit
    )
)
# gam1      gam2      gam3       man
# 1 12.410220 12.410215 12.410215 12.453005
# 2 10.660891 10.660886 10.660886 10.704449
# 3  9.191224  9.191220  9.191220  9.235621
# 4  8.153871  8.153867  8.153867  8.198276
# 5  7.724998  7.724996  7.724996  7.767261
# 6  8.003034  8.003034  8.003034  8.040678


## ---- with only s(Subject) random effects
want <- c(c(1,2), grep("s\\(Subject\\)", colnames(lp))) # regex is obnoxious here
fits <- lp[, want] %*% t(sim[, want])

se.fit <- apply(fits, 1, sd)

head(
    data.frame(
        gam1 = predict(model, terms=c("s(Subject)"),se.fit=TRUE)$se.fit,
        gam2 = predict(model_d, terms=c("s(Subject)"), discrete=TRUE, se.fit=TRUE)$se.fit,
        gam3 = predict(model_d, terms=c("s(Subject)"), discrete=FALSE, se.fit=TRUE)$se.fit,
        man = se.fit
    )
)

# gam1      gam2     gam3      man
# 1 12.41022 12.410215 12.41021 12.45300
# 2 12.34847 10.660886 12.34846 12.39594
# 3 12.48280  9.191220 12.48280 12.53395
# 4 12.80704  8.153867 12.80704 12.86074
# 5 13.30733  7.724996 13.30733 13.36248
# 6 13.96474  8.003034 13.96474 14.02039

Solution

  • This is a bug in the discrete prediction code. Fixed for mgcv_1.8-32. Thanks! Simon