Search code examples
rgammgcv

error in mgcv "incorrect number of dimensions"


I'm running a fairly hefty gam im mgcv (R version 4.3.1 on linux, ubuntu 22.0), and i've been running across this error. here is what the gam looks like:

model_ndvi <-
   gam(list(
      # mean predictor
         NDVI ~ cutblock +
            s(long, lat, bs = 'ds', k = 100, by = cutblock) + #observe trends geographically
            s(year, bs = 'cr', k = 7, by = cutblock) + #view trends over time
            s(doy, bs = 'cc', k = 8, by = cutblock) + #view yearly trends
            s(cb_age, bs = "cr", k = 3) + #view trends by cutblock age
            ti(year, doy, bs = 'cc') + #view how yearly trends changed over time
            ti(cb_age, year, bs = 'cr'), #see how the cutblock trends changed over time
        # variance predictor
          ~ cutblock +
            s(long, lat, bs = 'ds', k = 50, by = cutblock) +
            s(year, bs = 'cr', k = 5, by = cutblock) +
            s(doy, bs = 'cc', k = 4, by = cutblock)+
            s(cb_age, bs = "cr", k = 3) +
            ti(year, doy, bs = 'cc') +
            ti(cb_age, year, bs = 'cr')),
        family = betals(), #beta location scale distribution for the data
        data = babyd,
        method = 'REML',
        knots = list(doy = c(0.5, 366.5)),
        control = gam.control(nthreads = 30), trace = TRUE)

i've been running across the error Error in backsolve(L, forwardsolve(t(L), (D * A)[piv, ]))[ipiv, , drop = FALSE] : incorrect number of dimensions when trying to run the model. given that my dataset is about 87 GB large, i have been testing the model on a much smaller subset of data (about 90 MB), obtained from the original dataset using the slice function. from what i have seen, this error has to do with subsetting data that doesn't exist. i was worried that when i extracted a smaller amount of data from the original data set, there wouldn't be enough unique values and that would cause this error, but after inspecting the data by running unique on each variable i have seen that, as far as i know, there are more than enough unique values in each variable for the model to run.

here is a snippet of what the data looks like (using head):

  long      lat        NDVI    cutblock harvestyr  date    dec_date year doy cb_age
1 -127.2368 51.60493 -0.0275    FALSE        NA 2000-02-18 2000.131 2000  49     NA
2 -127.2368 51.60712 -0.0053    FALSE        NA 2000-02-18 2000.131 2000  49     NA
3 -127.2368 51.60932  0.1173    FALSE        NA 2000-02-18 2000.131 2000  49     NA
4 -127.2368 51.61151  0.2855    FALSE        NA 2000-02-18 2000.131 2000  49     NA
5 -127.2368 51.61371  0.2417    FALSE        NA 2000-02-18 2000.131 2000  49     NA
6 -127.2368 51.61591  0.4240    FALSE        NA 2000-02-18 2000.131 2000  49     NA

and here is what it looks like when i run str on the data:

'data.frame':   1474926 obs. of  10 variables:
 $ long     : num  -127 -127 -127 -127 -127 ...
 $ lat      : num  51.6 54 53.4 52.9 52.1 ...
 $ NDVI     : num  -0.0275 0.4184 -0.0103 -0.0292 0.5711 ...
 $ cutblock : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ harvestyr: num  NA NA NA NA NA NA NA NA NA NA ...
 $ date     : Date, format: "2000-02-18" "2000-02-18" "2000-02-18" "2000-02-18" ...
 $ dec_date : num  2000 2000 2000 2000 2000 ...
 $ year     : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ doy      : int  49 49 49 49 49 49 49 49 49 49 ...
 $ cb_age   : int  NA NA NA NA NA NA NA NA NA NA ...

for a bit more context about the model in case anyone needs more information - i'm looking at ndvi trends over the past 20 or so years within and outside of clearcut areas in forests (called cutblocks in the data) and trying to determine at what point these clearcut areas recover back to full productivity, taking into account the changes in both mean and variance. the betals function is a beta location scale distribution function, written for this project by simon wood (author of mgcv).

i've tried looking at the structure of the data to see if that is the issue, but all information online is very simple in relation to what i am trying to do. i'm wondering if anyone has come across this error before in mgcv! any help is appreciated :))


Solution

  • the solution to this ended up being quite simple: the location scale family wasn't automatically selecting the minimum possible number of knots for one of the terms. while i specified k = 3, the minimum was 5, so changing it fixed the error.

    this seems to be an issue for other location scale distribution families in mgcv as well - we tested the model using gaulss and had the same error, but changing the knots fixed it.