Search code examples
rplotlypredictgamrandom-effects

How to predict gam model with random effect in R?


I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.

Here is my code;

x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)

gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit

#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix

However, the last line results in error;

Error in model.frame.default(ff, data = newdata, na.action = na.act) : 
   variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
  not all required variables have been supplied in  newdata!

Thanks to the previous answer, I am sure that above codes work without random effect, as in here.

How can I predict gam models with random effect?


Solution

  • Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.

    The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.

    Option 1:

    newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
    z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
    z1 <- matrix(z1, 20, 30)
    

    Option 2:

    z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
                  newdata.guaranteed=TRUE)
    z2 <- matrix(z2, 20, 30)
    

    These produce the same result:

    > all.equal(z1, z2)
    [1] TRUE
    

    A couple of notes:

    1. Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.

    2. Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:

      expand.grid(x = newx, y = newy,
                  r = with(df, factor(2, levels = levels(r))))
      

      should get the right set-up for a factor r