Search code examples
rpackagepredictiongammgcv

Environmental problems while predicting from gaulss-gams with a custom variance function inside a package


A variance function available within an R package is not found by the predict function while doing predictions from a gam-object constructed previously (mgcv 1.8-31).

The R package shall (amongst other things) predict from gam-objects. All of the previously constructed models use the gaulss-family and have their own variance function. Some variance functions are just a linear effect of a variable, others use more sophisticated custom functions. The models and variance functions were stored in a 'sysdata.rda' file to include them in the package. The package was documented with devtools and roxygen2.

Consider the following minimal example of two GAM's:

library(mgcv)

set.seed(123)

var.fun <- function(x){x^2}

x <- runif(100)
y <- x + rnorm(100, 0, var.fun(x))

mod.gam.1 <- gam(formula = list(y ~ x,
                                ~ var.fun(x)),
                 family = gaulss(link = list("log", "logb")))

mod.gam.2 <- gam(formula = list(y ~ x,
                                ~ I(x^2)),
                 family = gaulss(link = list("log", "logb")))

The first model uses a custom variance function. The second model has the variance formula hardcoded in the model call.

The models and the variance function are then stored in a 'sysdata.rda' file, which is included in a package named "gamvarfun" (I know, naming things... ), as follows:

save(mod.gam.1, var.fun, mod.gam.2,
     file = "~/gamvarfun/R/sysdata.rda")

Now two functions are added to the package to retrieve predictions from the corresponding models:

pred.fun.1 <- function(x){
  predict(mod.gam.1,
          newdata = data.frame("x" = x))
}

pred.fun.2 <- function(x){
  predict(mod.gam.2,
          newdata = data.frame("x" = x))
}

To illustrate the problem I've uploaded a very rudimentary demo-package to github, so the following code should work:

library(devtools)
install_github("jan-schick/gam_issue")
library(gamvarfun)

pred.fun.2(1)
# 0.3135275 0.5443409

pred.fun.1(1)
# Error in var.fun(x) : could not find function "var.fun"

# Writing var.fun to global environment
var.fun <- gamvarfun:::var.fun
pred.fun.1(1)
# 0.3135275 0.5443409

When using pred.fun.1 (containing a custom variance function), an error is displayed. However, pred.fun.2 (hardcoded variance function) works perfectly well. This error does not occur, when devtools::load_all() is used instead of a 'proper' install of the package. I suspected that the problem is due to different environments when using predict.gam. I tested this assumption by writing the custom variance function to the global environment before calling pred.fun.1 (see above), which worked. However, this is obviously no solution for a package.

In earlier attempts I tried to declare the function inside the package, e.g. by placing the code written above directly inside the prediction functions. Which also didn't work when the package was installed.

pred.fun.1 <- function(x){
  var.fun <- function(x){x^2}
  predict(mod.gam.1,
          newdata = data.frame("x" = x))
}

I've also tried with and attach at the same place (inside the prediction functions) without any success.

The only working solution I found, was to export the variance functions and make it part of the package namespace/API. This is no feasible solution in this case as it would lead to numerous visible variance functions, which have no practical benefit for the user.

Then there is the obvious workaround: replacing the variance functions by the original formulas in the model call, i.e. using mod.gam.2 instead of mod.gam.1. However, this is not a proper solution either.

Various search engines and colleagues have been consulted to no avail.

Thus, I would be grateful for any hints how to tackle this problem

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] gamvarfun_0.1.0 usethis_1.5.0   devtools_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        rstudioapi_0.10   magrittr_1.5      splines_3.6.2
 [5] pkgload_1.0.2     lattice_0.20-38   R6_2.4.0          rlang_0.4.0
 [9] tools_3.6.2       grid_3.6.2        pkgbuild_1.0.3    nlme_3.1-143
[13] mgcv_1.8-31       sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2
[17] remotes_2.0.4     assertthat_0.2.1  digest_0.6.20     rprojroot_1.3-2
[21] crayon_1.3.4      Matrix_1.2-18     processx_3.3.1    callr_3.2.0
[25] fs_1.3.1          ps_1.3.0          curl_4.0          testthat_2.1.1
[29] memoise_1.1.0     glue_1.3.1        compiler_3.6.2    desc_1.2.0
[33] backports_1.1.4   prettyunits_1.0.2

Solution

  • The problem is that formulas have environments associated with them, and you are saving the association with globalenv(), but putting the function somewhere else.

    This isn't easy to fix if you stick with sysdata.rda, because the mod.gam.1 object copies that environment into multiple locations. It's not enough to patch environment(mod.gam.1$formula).

    I think the only feasible approach is to include the source that produced the mod.gam.1 object within the package source. If the formulas are defined in a context that inherits from the package namespace, then the variance function defined there will be found.

    Edited to add: I tried this strategy, and couldn't get it to work. It looks as though there's a bug in the mgcv package, specifically in the code that interprets the special dual formula that the gualss model uses. I'll see if I can find a workaround.

    Second edit: Running the code below appears to fix the functions in mgcv that have the bug. They are based on mgcv version 1.8-31. I don't recommend running this script unless you have exactly that version installed. I've sent a message to the package maintainer; maybe these will be incorporated into a future release.

    interpret.gam0 <- function(gf,textra=NULL,extra.special=NULL)
    # interprets a gam formula of the generic form:
    #   y~x0+x1+x3*x4 + s(x5)+ s(x6,x7) ....
    # and returns:
    # 1. a model formula for the parametric part: pf (and pfok indicating whether it has terms)
    # 2. a list of descriptors for the smooths: smooth.spec
    # this is function does the work, and is called by in interpret.gam
    # 'textra' is optional text to add to term labels
    # 'extra.special' is label of extra smooth within formula.
    { p.env <- environment(gf) # environment of formula
      tf <- terms.formula(gf,specials=c("s","te","ti","t2",extra.special)) # specials attribute indicates which terms are smooth
    
      terms <- attr(tf,"term.labels") # labels of the model terms 
      nt <- length(terms) # how many terms?
    
      if (attr(tf,"response") > 0) {  # start the replacement formulae
        response <- as.character(attr(tf,"variables")[2])
      } else { 
        response <- NULL
      }
      sp <- attr(tf,"specials")$s     # array of indices of smooth terms 
      tp <- attr(tf,"specials")$te    # indices of tensor product terms
      tip <- attr(tf,"specials")$ti   # indices of tensor product pure interaction terms
      t2p <- attr(tf,"specials")$t2   # indices of type 2 tensor product terms
      zp <- if (is.null(extra.special)) NULL else attr(tf,"specials")[[extra.special]]
      off <- attr(tf,"offset") # location of offset in formula
    
      ## have to translate sp, tp, tip, t2p (zp) so that they relate to terms,
      ## rather than elements of the formula...
      vtab <- attr(tf,"factors") # cross tabulation of vars to terms
      if (length(sp)>0) for (i in 1:length(sp)) {
        ind <- (1:nt)[as.logical(vtab[sp[i],])]
        sp[i] <- ind # the term that smooth relates to
      }
      if (length(tp)>0) for (i in 1:length(tp)) {
        ind <- (1:nt)[as.logical(vtab[tp[i],])]
        tp[i] <- ind # the term that smooth relates to
      } 
      if (length(tip)>0) for (i in 1:length(tip)) {
        ind <- (1:nt)[as.logical(vtab[tip[i],])]
        tip[i] <- ind # the term that smooth relates to
      } 
      if (length(t2p)>0) for (i in 1:length(t2p)) {
        ind <- (1:nt)[as.logical(vtab[t2p[i],])]
        t2p[i] <- ind # the term that smooth relates to
      }
      if (length(zp)>0) for (i in 1:length(zp)) {
        ind <- (1:nt)[as.logical(vtab[zp[i],])]
        zp[i] <- ind # the term that smooth relates to
      } ## re-referencing is complete
    
      k <- kt <- kti <- kt2 <- ks <- kz <- kp <- 1 # counters for terms in the 2 formulae
      len.sp <- length(sp)
      len.tp <- length(tp)
      len.tip <- length(tip)
      len.t2p <- length(t2p)
      len.zp <- length(zp)
      ns <- len.sp + len.tp + len.tip + len.t2p + len.zp# number of smooths
      pav <- av <- rep("",0)
      smooth.spec <- list()
      #mgcvat <- "package:mgcv" %in% search() ## is mgcv in search path?
      mgcvns <- loadNamespace('mgcv')
      if (nt) for (i in 1:nt) { # work through all terms
        if (k <= ns&&((ks<=len.sp&&sp[ks]==i)||(kt<=len.tp&&tp[kt]==i)||(kz<=len.zp&&zp[kz]==i)||
                      (kti<=len.tip&&tip[kti]==i)||(kt2<=len.t2p&&t2p[kt2]==i))) { # it's a smooth
          ## have to evaluate in the environment of the formula or you can't find variables 
          ## supplied as smooth arguments, e.g. k <- 5;gam(y~s(x,k=k)), fails,
          ## but if you don't specify namespace of mgcv then stuff like 
          ## loadNamespace('mgcv'); k <- 10; mgcv::interpret.gam(y~s(x,k=k)) fails (can't find s)
          ## eval(parse(text=terms[i]),envir=p.env,enclos=loadNamespace('mgcv')) fails??
          ## following may supply namespace of mgcv explicitly if not on search path...
          ## If 's' etc are masked then we can fail even if mgcv on search path, hence paste
          ## of explicit mgcv reference into first attempt...
    
          st <- try(eval(parse(text=paste("mgcv::",terms[i],sep="")),envir=p.env),silent=TRUE)
          if (inherits(st,"try-error")) st <- 
                eval(parse(text=terms[i]),enclos=p.env,envir=mgcvns)
    
          if (!is.null(textra)) { ## modify the labels on smooths with textra
            pos <- regexpr("(",st$lab,fixed=TRUE)[1]
            st$label <- paste(substr(st$label,start=1,stop=pos-1),textra,
                        substr(st$label,start=pos,stop=nchar(st$label)),sep="")
          }
          smooth.spec[[k]] <- st
          if (ks<=len.sp&&sp[ks]==i) ks <- ks + 1 else # counts s() terms
          if (kt<=len.tp&&tp[kt]==i) kt <- kt + 1 else # counts te() terms
          if (kti<=len.tip&&tip[kti]==i) kti <- kti + 1 else # counts ti() terms
          if (kt2<=len.t2p&&t2p[kt2]==i) kt2 <- kt2 + 1 # counts t2() terms
          else kz <- kz + 1
          k <- k + 1      # counts smooth terms 
        } else {          # parametric
          av[kp] <- terms[i] ## element kp on rhs of parametric
          kp <- kp+1    # counts parametric terms
        }
      }    
      if (!is.null(off)) { ## deal with offset 
        av[kp] <- as.character(attr(tf,"variables")[1+off])
        kp <- kp+1          
      }
    
      pf <- paste(response,"~",paste(av,collapse=" + "))
      if (attr(tf,"intercept")==0) {
        pf <- paste(pf,"-1",sep="")
        if (kp>1) pfok <- 1 else pfok <- 0
      } else { 
        pfok <- 1;if (kp==1) { 
          pf <- paste(pf,"1"); 
        }
      }
    
      fake.formula <- pf
    
      if (length(smooth.spec)>0) 
      for (i in 1:length(smooth.spec)) {
        nt <- length(smooth.spec[[i]]$term)
        ff1 <- paste(smooth.spec[[i]]$term[1:nt],collapse="+")
        fake.formula <- paste(fake.formula,"+",ff1)
        if (smooth.spec[[i]]$by!="NA") {
          fake.formula <- paste(fake.formula,"+",smooth.spec[[i]]$by)
          av <- c(av,smooth.spec[[i]]$term,smooth.spec[[i]]$by)
        } else av <- c(av,smooth.spec[[i]]$term)
      }
      fake.formula <- as.formula(fake.formula,p.env)
      if (length(av)) {
        pred.formula <- as.formula(paste("~",paste(av,collapse="+")))
        pav <- all.vars(pred.formula) ## trick to strip out 'offset(x)' etc...
        pred.formula <- reformulate(pav) 
      environment(pred.formula) <- environment(gf)
      } else  pred.formula <- ~1
      ret <- list(pf=as.formula(pf,p.env),pfok=pfok,smooth.spec=smooth.spec,
                fake.formula=fake.formula,response=response,fake.names=av,
                pred.names=pav,pred.formula=pred.formula)
      class(ret) <- "split.gam.formula"
      ret
    } ## interpret.gam0
    
    interpret.gam <- function(gf,extra.special=NULL) {
    ## wrapper to allow gf to be a list of formulae or 
    ## a single formula. This facilitates general penalized 
    ## likelihood models in which several linear predictors 
    ## may be involved...
    ##
    ## The list syntax is as follows. The first formula must have a response on
    ## the lhs, rather than labels. For m linear predictors, there 
    ## must be m 'base formulae' in linear predictor order. lhs labels will 
    ## be ignored in a base formula. Empty base formulae have '-1' on rhs.
    ## Further formulae have labels up to m labels 1,...,m on the lhs, in a 
    ## syntax like this: 3 + 5 ~ s(x), which indicates that the same s(x) 
    ## should be added to both linear predictors 3 and 5. 
    ## e.g. A bivariate normal model with common expected values might be
    ## list(y1~-1,y2~-1,1+2~s(x)), whereas if the second component was contaminated 
    ## by something else we might have list(y1~-1,y2~s(v)-1,1+2~s(x)) 
    ## 
    ## For a list argument, this routine returns a list of split.formula objects 
    ## with an extra field "lpi" indicating the linear predictors to which each 
    ## contributes...
      if (is.list(gf)) {
        d <- length(gf)
    
        ## make sure all formulae have a response, to avoid
        ## problems with parametric sub formulae of the form ~1
        #if (length(gf[[1]])<3) stop("first formula must specify a response")
        resp <- gf[[1]][2]
    
        ret <- list()
        pav <- av <- rep("",0)
        nlp <- 0 ## count number of linear predictors (may be different from number of formulae)
        for (i in 1:d) {
          textra <- if (i==1) NULL else paste(".",i-1,sep="") ## modify smooth labels to identify to predictor  
    
          lpi <- getNumericResponse(gf[[i]]) ## get linear predictors to which this applies, if explicit
          if (length(lpi)==1) warning("single linear predictor indices are ignored")
          if (length(lpi)>0) gf[[i]][[2]] <- NULL else { ## delete l.p. labels from formula response 
            nlp <- nlp + 1;lpi <- nlp ## this is base formula for l.p. number nlp       
          }
          ret[[i]] <- interpret.gam0(gf[[i]],textra,extra.special=extra.special)
          ret[[i]]$lpi <- lpi ## record of the linear predictors to which this applies
    
          ## make sure all parametric formulae have a response, to avoid
          ## problems with parametric sub formulae of the form ~1
          respi <- rep("",0) ## no extra response terms
          if (length(ret[[i]]$pf)==2) { 
            ret[[i]]$pf[3] <- ret[[i]]$pf[2];ret[[i]]$pf[2] <- resp
            respi <- rep("",0)
          } else if (i>1) respi <- ret[[i]]$response ## extra response terms
          av <- c(av,ret[[i]]$fake.names,respi) ## accumulate all required variable names 
          pav <- c(pav,ret[[i]]$pred.names) ## predictors only 
        } 
        av <- unique(av) ## strip out duplicate variable names
        pav <- unique(pav)
        if (length(av)>0) {
          ## work around - reformulate with response = "log(x)" will treat log(x) as a name,
          ## not the call it should be... 
          fff <- formula(paste(ret[[1]]$response,"~ ."))
          ret$fake.formula <- reformulate(av,response=ret[[1]]$response) 
          environment(ret$fake.formula) <- environment(gf[[1]]) 
          ret$fake.formula[[2]] <- fff[[2]] ## fix messed up response
        } else ret$fake.formula <- ret[[1]]$fake.formula ## create fake formula containing all variables
        ret$pred.formula <- if (length(pav)>0) reformulate(pav) else ~1 ## predictor only formula
        environment(ret$pred.formula) <- environment(gf[[1]])
        ret$response <- ret[[1]]$response 
        ret$nlp <- nlp ## number of linear predictors
        for (i in 1:d) if (max(ret[[i]]$lpi)>nlp||min(ret[[i]]$lpi)<1) stop("linear predictor labels out of range")
        class(ret) <- "split.gam.formula"
        return(ret)
      } else interpret.gam0(gf,extra.special=extra.special)  
    } ## interpret.gam
    
    
    ## Now some test code.
    
    environment(interpret.gam) <- environment(mgcv::interpret.gam)
    environment(interpret.gam0) <- environment(mgcv:::interpret.gam0)
    assignInNamespace("interpret.gam", interpret.gam, "mgcv")
    assignInNamespace("interpret.gam0", interpret.gam0, "mgcv")
    
    set.seed(123)
    
    mod.gam.1 <- local({
        var.fun <- function(x){x^2}
    
        x <- runif(100)
        y <- x + rnorm(100, 0, var.fun(x))
    
        gam(formula = list(y ~ x,
                                             ~ var.fun(x)),
                family = gaulss(link = list("log", "logb")))
    })
    
    pred.fun.1 <- function(x){
        predict(mod.gam.1,
                      newdata = data.frame("x" = x))
    }
    
    pred.fun.1(1)