Search code examples
warningsvariancegamgratia

Variance calculation warning: longer object length is not a multiple


I'm getting this warning after checking the variance components of my model. I don't have NA's and the warning goes away when I remove this s(CYR.std, fSite, bs = "fs") term. Is it possibly related to having only 1 observation associated with a fSite in a year (CYR.std)? I get the same warning with the full dataset.

gratia::variance_comp(m1)
# A tibble: 4 × 5
  component         variance   std_dev   lower_ci    upper_ci
  <chr>                <dbl>     <dbl>      <dbl>       <dbl>
1 s(sal)            4.47e-10 0.0000211 0.00000968   0.0000462
2 s(CYR.std,fSite)1 3.92e- 2 0.198     0.112        0.352    
3 s(CYR.std,fSite)2 6.13e- 1 0.783     0          Inf        
4 s(CYR.std,fSite)3 3.43e- 7 0.000586  0.000268     0.00128  

Warning messages:
1: In lsd - crit * sd.lsd :
  longer object length is not a multiple of shorter object length
2: In lsd + crit * sd.lsd :
  longer object length is not a multiple of shorter object length

Random subset of data recreates the error:

MyVar <- c("fCYR", "CYR.std", "fSeason", "fSite", "area_sampled", "num", "sal")

x <- shrimp2[MyVar][sample(nrow(shrimp2[MyVar]), 100), ]

na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))

na_count <- data.frame(na_count)
na_count
             na_count
fCYR                0
CYR.std             0
fSeason             0
fSite               0
area_sampled        0
num                 0
sal                 0

library(mgcv)    

m1 <- bam(num ~ s(sal) + 
            
            CYR.std*fSeason +
            
            s(CYR.std, fSite, bs = "fs"), 
          offset(log(area_sampled)),
          method = "fREML",
          discrete = TRUE,
          family = nb(link = "log"),
          data = x)

UPDATE:

I even get the same error with the most basic of random effects...

m <- bam(num ~ s(sal) + 

            CYR.std + 
           
            fSeason +
         
            s(fSite, bs = "re") + 

            s(fCYR, bs = "re"), 
          offset(log(area_sampled)), 
          method = "fREML",
          discrete = TRUE,
          family = nb(link = "log"),
          data = x)

> mgcv::gam.vcomp(m)

Standard deviations and 0.95 confidence intervals:

            std.dev     lower      upper
s(sal)   0.01623266 0.0105176 0.02505317
s(fSite) 0.83502554 0.3549272 1.96453684
s(fCYR)  0.54846495 0.3553660 0.84649014

Rank: 2/2
Warning messages:
1: In lsd - crit * sd.lsd :
  longer object length is not a multiple of shorter object length
2: In lsd + crit * sd.lsd :
  longer object length is not a multiple of shorter object length

Full dataset (n=1327):

> table(shrimp2$CYR.std, shrimp2$fSite)
    
     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
  0  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  1  2  2  2  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  1  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2 2 2 2 2 2 1 1 1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  2  2  2  2  2  2  2  2  2  2  2  2
  3  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  4  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  5  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  2  2  2
  6  2 1 1 1 1 1 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1
  7  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  8  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  9  2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  1  2  2  1  2  2  2  2  2  2  1  2  1  2  2  2  2  2  2  2  2  2  2  2
  10 2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  11 2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  12 2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  13 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  14 2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
    
     42 43 44 45 46 47
  0   2  2  2  1  2  2
  1   2  2  2  2  2  2
  2   2  1  1  1  1  1
  3   2  2  2  2  2  2
  4   2  2  2  2  2  2
  5   2  2  2  2  2  2
  6   1  1  1  1  1  1
  7   2  2  2  2  2  2
  8   2  2  2  2  2  2
  9   2  2  2  2  2  2
  10  2  2  2  2  2  2
  11  2  2  2  2  2  2
  12  2  2  2  2  2  2
  13  1  1  1  1  1  1
  14  2  2  2  2  2  2

Smaller dataset:

> dput(x)
structure(list(fCYR = structure(c(9L, 9L, 12L, 15L, 11L, 11L, 
8L, 11L, 2L, 9L, 1L, 14L, 13L, 10L, 3L, 15L, 14L, 6L, 7L, 1L, 
9L, 8L, 6L, 8L, 2L, 4L, 3L, 6L, 7L, 15L, 11L, 1L, 2L, 9L, 5L, 
5L, 2L, 6L, 11L, 6L, 13L, 12L, 9L, 6L, 6L, 6L, 7L, 5L, 1L, 12L, 
1L, 14L, 14L, 5L, 8L, 2L, 4L, 1L, 15L, 3L, 8L, 12L, 12L, 15L, 
4L, 11L, 5L, 1L, 9L, 6L, 4L, 7L, 15L, 7L, 3L, 6L, 11L, 14L, 3L, 
10L, 11L, 13L, 7L, 4L, 11L, 3L, 10L, 10L, 9L, 1L, 10L, 10L, 1L, 
10L, 9L, 9L, 7L, 15L, 11L, 14L), levels = c("2008", "2009", "2010", 
"2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", 
"2019", "2020", "2021", "2022"), class = "factor"), CYR.std = c(8L, 
8L, 11L, 14L, 10L, 10L, 7L, 10L, 1L, 8L, 0L, 13L, 12L, 9L, 2L, 
14L, 13L, 5L, 6L, 0L, 8L, 7L, 5L, 7L, 1L, 3L, 2L, 5L, 6L, 14L, 
10L, 0L, 1L, 8L, 4L, 4L, 1L, 5L, 10L, 5L, 12L, 11L, 8L, 5L, 5L, 
5L, 6L, 4L, 0L, 11L, 0L, 13L, 13L, 4L, 7L, 1L, 3L, 0L, 14L, 2L, 
7L, 11L, 11L, 14L, 3L, 10L, 4L, 0L, 8L, 5L, 3L, 6L, 14L, 6L, 
2L, 5L, 10L, 13L, 2L, 9L, 10L, 12L, 6L, 3L, 10L, 2L, 9L, 9L, 
8L, 0L, 9L, 9L, 0L, 9L, 8L, 8L, 6L, 14L, 10L, 13L), fSeason = structure(c(2L, 
1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 
1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
2L, 2L, 2L), levels = c("DRY", "WET"), class = "factor"), fSite = structure(c(18L, 
33L, 27L, 15L, 47L, 8L, 3L, 29L, 34L, 6L, 44L, 8L, 11L, 14L, 
4L, 43L, 19L, 39L, 15L, 47L, 24L, 36L, 19L, 39L, 39L, 33L, 16L, 
35L, 40L, 11L, 42L, 31L, 40L, 25L, 6L, 35L, 38L, 16L, 34L, 42L, 
47L, 9L, 27L, 11L, 29L, 35L, 36L, 46L, 11L, 21L, 1L, 31L, 10L, 
32L, 6L, 22L, 42L, 25L, 44L, 28L, 15L, 20L, 8L, 14L, 23L, 6L, 
8L, 32L, 2L, 45L, 31L, 20L, 7L, 37L, 46L, 39L, 36L, 35L, 4L, 
1L, 19L, 16L, 2L, 4L, 23L, 20L, 10L, 27L, 15L, 34L, 40L, 8L, 
13L, 26L, 39L, 43L, 34L, 39L, 22L, 46L), levels = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", 
"15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", 
"26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", 
"37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47"
), class = "factor"), area_sampled = c(3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), num = c(0L, 
1L, 0L, 0L, 2L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 2L, 0L, 0L, 2L, 0L, 
0L, 0L, 0L, 0L, 3L, 0L, 19L, 3L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 
2L, 0L, 0L, 0L, 0L, 0L, 1L, 2L, 1L, 2L, 0L, 0L, 1L, 0L, 1L, 0L, 
3L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 
6L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 
1L, 0L, 0L, 0L, 2L, 1L, 0L, 1L, 0L, 1L, 0L, 3L, 0L, 0L, 0L, 1L, 
0L, 2L, 2L, 1L), sal = c(4.78, 18.55, 19.72, 25.08, 16.48, 15.47, 
24.88, 15.12, 25.69, 23.64, 33.58, 24.37, 31.77, 28.98, 21.14, 
25.37, 20, 23.03, 27.01, 27.46, 8.06, 29.7, 27.44, 30.41, 21.13, 
22.47, 16.8, 23.54, 37.1, 30.16, 29.26, 22.08, 31.68, 10.56, 
19.1, 14.73, 29.34, 17.77, 20.07, 32.71, 30.47, 27.76, 11.26, 
31.8, 26.85, 10.79, 17.31, 26.85, 20.62, 26.58, 30.46, 23.9, 
23.01, 22.3, 20.49, 18.19, 29.74, 21.27, 23.74, 25.42, 23.46, 
26.33, 30.99, 29.99, 29.77, 21, 29, 21.37, 15.83, 32.36, 33.52, 
24.76, 28.48, 17.23, 33.79, 19.89, 2.79, 21.75, 29.98, 15.18, 
13.48, 29.94, 27.39, 37.13, 4.04, 21.7, 11.29, 6.3, 11.37, 21.44, 
19.82, 10.03, 22.73, 21.73, 8.76, 13.49, 18.21, 35.25, 4.15, 
25.88)), row.names = c(268512L, 258208L, 329600L, 387560L, 312120L, 
309912L, 253424L, 298136L, 109720L, 264096L, 75128L, 381304L, 
341008L, 290224L, 128304L, 383880L, 373760L, 198776L, 235208L, 
65376L, 258576L, 238336L, 203192L, 245328L, 89664L, 142104L, 
134560L, 202088L, 227112L, 393080L, 296112L, 65744L, 110824L, 
258760L, 195280L, 187736L, 110456L, 211656L, 299056L, 199328L, 
337144L, 335120L, 266304L, 204664L, 200984L, 210552L, 220488L, 
172096L, 83040L, 323528L, 62800L, 375968L, 379280L, 173200L, 
254160L, 107512L, 152592L, 68872L, 384064L, 132720L, 252136L, 
323344L, 317456L, 395840L, 164552L, 309360L, 177248L, 65928L, 
271456L, 199880L, 162344L, 231344L, 386088L, 220672L, 138792L, 
207792L, 310096L, 376704L, 122048L, 275320L, 299976L, 342296L, 
225272L, 166024L, 315248L, 120392L, 275504L, 282864L, 267960L, 
66296L, 286544L, 276608L, 83408L, 288200L, 256736L, 264832L, 
220120L, 402096L, 315064L, 378912L), class = "data.frame")

Solution

  • I think you'll need to speak to Simon Wood about this one; the warning is raised in his mgcv::gam.vcomp(), which gratia calls (all variance_comp() does is put some sugar on top of the hard work mgcv does; returns the output as a tibble, adds the variance as a column).

    I think there is something weird going on. The warnings come from these two lines in gam.vcomp:

    ll <- lsd - crit * sd.lsd
    ul <- lsd + crit * sd.lsd
    

    and the problem comes from the sd.lsd vector. This isn't of length 4:

    > lsd
               s(sal) s(CYR.std,fSite)1 s(CYR.std,fSite)2 s(CYR.std,fSite)3 
           -3.4839424        -2.8555567        -0.4410871        -1.3827881
    > sd.lsd
    [1] 2.5010946 0.3471288 1.1410316
    

    and this is because the hessian from the outer iteration is not of the same rank when fitted with bam() as it is with gam(), and I think the current code either assumes something that is gam()-specific or fitting with bam() is highlight a potential issue with your data/model. For example, fitting with discrete = FALSE (on the subset you provided) resulted in a warning that something didn't converge (even though the outer iteration seemingly converged).

    In this model, fitted with gam(), the hessian (of the smoothing parameters) is a 5x5 matrix because the outer iteration is also over $\theta$ parameter for the negative binomial. The code in gam.vcomp() checks to see if there are any extra parameters outwith the smoothing parameters being estimated and if there are it drops those columns from the hessian. In the case of the bam() fit, the hessian is actually a 4x4 matrix - which seems odd - even though there should be a $\theta$ being estimated.

    I think that there's a column missing with bam() is actually telling you something about the model; it looks like it is over-fitted to the data. If I look at gam.vcomp() on the model fitted with gam() I see:

    > gam.vcomp(m1) 
    
    Standard deviations and 0.95 confidence intervals:
    
                           std.dev         lower         upper
    s(sal)            0.0102080074  2.083020e-06  5.002517e+01
    s(CYR.std,fSite)1 0.0008840011  1.561116e-82  5.005766e+75
    s(CYR.std,fSite)2 0.6726926834  3.229118e-01  1.401359e+00
    s(CYR.std,fSite)3 0.0008083213 1.717862e-201 3.803468e+194
    
    Rank: 4/4
    

    where at least two of the parameters have effectively 0 variance and a confidence interval that goes from (effectively) 0 to (again effectively) infinity, which is a pretty strong indication that you are asking too much from this model and data combination.

    The smooths estimated by s(CYR.std,fSite) are linear when fitted with gam():

    enter image description here

    while with bam() with discrete set to TRUE we see

    enter image description here

    and with discrete = FALSE we see

    enter image description here

    but I doubt I'd believe the two latter fits, especially in light of the model with discrete = FALSE emitting a convergence failure warning, and the problems highlighted with gam.vcomp() on the gam() version.

    While I think this is ultimately a data issue — you're asking too much from a these data — there is a problem with gam.vcomp() as it is dropping a column from the hessian for the theta but the hessian is 1 rank lower than it should be. And I think it should be 5x5 because there is $\theta$, 1 smoothing parameter for s(sal) and three for s(CYR.std, fSite, bs = "fs"). But I'm not familiar enough with bam() to know why it is returning a smaller hessian than gam(). Unless it is the result of the rank-deficiency of the model given the data causing some problem earlier in the chain — bam() does a lot of fancy matrix operations so something could be getting lost in one of those operations that don't happen when fitting with gam().

    When you contact Simon, feel free to link to this Q&A, but do send him the subset of data as a CSV file and the complete code needed to load the data and fit the model, raise the warning. You want a minimal working example:

    # load mgcv
    library("mgcv")
    # load data either via a read.csv() call or your `structure()`
    # !!!! Note I have truncated it here for this example
    x <- structure(list(fCYR = structure(c(9L, 9L, 12L, 15L, 11L, 11L, 
    8L, 11L, 2L, 9L, 1L, 14L, 13L, 10L, 3L, 15L, 14L, 6L, 7L, 1L,
    ....
    66296L, 286544L, 276608L, 83408L, 288200L, 256736L, 264832L, 
    220120L, 402096L, 315064L, 378912L), class = "data.frame")
    
    # fit model
    m_bam <- bam(num ~ s(sal) + CYR.std*fSeason + s(CYR.std, fSite, bs = "fs"), 
              offset(log(area_sampled)),
              method = "fREML",
              discrete = TRUE,
              family = nb(link = "log"),
              data = x)
    
    # variance comps, raises warning
    gam.vcomp(m_bam)
    
    # but with `gam()` everything OK
    m_gam <- bam(num ~ s(sal) + CYR.std*fSeason + s(CYR.std, fSite, bs = "fs"), 
              offset(log(area_sampled)),
              method = "REML",
              family = nb(link = "log"),
              data = x)
    gam.vcomp(m_gam)
    
    # note different-sized hessians
    # bam()
    m_bam$outer.info$hess
    # gam()
    m_gam$outer.info$hess
    

    Show the pertinent output in your message (that from gam.vcomp() and the hessians, plus the bam() call) but make sure Simon and just run the code himself (so don't interleve output).

    I will admit to getting very confused about your modelling process with all your questions. It seems like non of your questions fit the same model structure, yet are seemingly being fitted to the same data. I'm a little worried you are torturing these data to the point of death and their death throws are higlighting issues. Do you really need FS smooths in the full data?