A variance function available within an R package is not found by the
predict
function while doing predictions from a gam-object
constructed previously (mgcv 1.8-31).
The R package shall (amongst other things) predict from gam-objects. All
of the previously constructed models use the gaulss-family and have their
own variance function. Some variance functions are just a linear effect
of a variable, others use more sophisticated custom functions. The
models and variance functions were stored in a 'sysdata.rda' file to
include them in the package. The package was documented with devtools
and roxygen2
.
Consider the following minimal example of two GAM's:
library(mgcv)
set.seed(123)
var.fun <- function(x){x^2}
x <- runif(100)
y <- x + rnorm(100, 0, var.fun(x))
mod.gam.1 <- gam(formula = list(y ~ x,
~ var.fun(x)),
family = gaulss(link = list("log", "logb")))
mod.gam.2 <- gam(formula = list(y ~ x,
~ I(x^2)),
family = gaulss(link = list("log", "logb")))
The first model uses a custom variance function. The second model has the variance formula hardcoded in the model call.
The models and the variance function are then stored in a 'sysdata.rda' file, which is included in a package named "gamvarfun" (I know, naming things... ), as follows:
save(mod.gam.1, var.fun, mod.gam.2,
file = "~/gamvarfun/R/sysdata.rda")
Now two functions are added to the package to retrieve predictions from the corresponding models:
pred.fun.1 <- function(x){
predict(mod.gam.1,
newdata = data.frame("x" = x))
}
pred.fun.2 <- function(x){
predict(mod.gam.2,
newdata = data.frame("x" = x))
}
To illustrate the problem I've uploaded a very rudimentary demo-package to github, so the following code should work:
library(devtools)
install_github("jan-schick/gam_issue")
library(gamvarfun)
pred.fun.2(1)
# 0.3135275 0.5443409
pred.fun.1(1)
# Error in var.fun(x) : could not find function "var.fun"
# Writing var.fun to global environment
var.fun <- gamvarfun:::var.fun
pred.fun.1(1)
# 0.3135275 0.5443409
When using pred.fun.1
(containing a custom variance function), an
error is displayed. However, pred.fun.2
(hardcoded variance function)
works perfectly well. This error does not occur, when
devtools::load_all()
is used instead of a 'proper' install of the
package. I suspected that the problem is due to different environments
when using predict.gam
. I tested this assumption by writing the custom
variance function to the global environment before calling pred.fun.1
(see above), which worked. However, this is obviously no solution for a
package.
In earlier attempts I tried to declare the function inside the package, e.g. by placing the code written above directly inside the prediction functions. Which also didn't work when the package was installed.
pred.fun.1 <- function(x){
var.fun <- function(x){x^2}
predict(mod.gam.1,
newdata = data.frame("x" = x))
}
I've also tried with
and attach
at the same place (inside the
prediction functions) without any success.
The only working solution I found, was to export the variance functions and make it part of the package namespace/API. This is no feasible solution in this case as it would lead to numerous visible variance functions, which have no practical benefit for the user.
Then there is the obvious workaround: replacing the variance functions
by the original formulas in the model call, i.e. using mod.gam.2
instead of mod.gam.1
. However, this is not a proper solution
either.
Various search engines and colleagues have been consulted to no avail.
Thus, I would be grateful for any hints how to tackle this problem
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gamvarfun_0.1.0 usethis_1.5.0 devtools_2.0.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 rstudioapi_0.10 magrittr_1.5 splines_3.6.2
[5] pkgload_1.0.2 lattice_0.20-38 R6_2.4.0 rlang_0.4.0
[9] tools_3.6.2 grid_3.6.2 pkgbuild_1.0.3 nlme_3.1-143
[13] mgcv_1.8-31 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2
[17] remotes_2.0.4 assertthat_0.2.1 digest_0.6.20 rprojroot_1.3-2
[21] crayon_1.3.4 Matrix_1.2-18 processx_3.3.1 callr_3.2.0
[25] fs_1.3.1 ps_1.3.0 curl_4.0 testthat_2.1.1
[29] memoise_1.1.0 glue_1.3.1 compiler_3.6.2 desc_1.2.0
[33] backports_1.1.4 prettyunits_1.0.2
The problem is that formulas have environments associated with them, and you are saving the association with globalenv()
, but putting the function somewhere else.
This isn't easy to fix if you stick with sysdata.rda
, because the mod.gam.1
object copies that environment into multiple locations. It's not enough to patch environment(mod.gam.1$formula)
.
I think the only feasible approach is to include the source that produced the mod.gam.1
object within the package source. If the formulas are defined in a context that inherits from the package namespace, then the variance function defined there will be found.
Edited to add: I tried this strategy, and couldn't get it to work. It looks as though there's a bug in the mgcv
package, specifically in the code that interprets the special dual formula that the gualss
model uses. I'll see if I can find a workaround.
Second edit: Running the code below appears to fix the functions in mgcv
that have the bug. They are based on mgcv
version 1.8-31. I don't recommend running this script unless you have exactly that version installed. I've sent a message to the package maintainer; maybe these will be incorporated into a future release.
interpret.gam0 <- function(gf,textra=NULL,extra.special=NULL)
# interprets a gam formula of the generic form:
# y~x0+x1+x3*x4 + s(x5)+ s(x6,x7) ....
# and returns:
# 1. a model formula for the parametric part: pf (and pfok indicating whether it has terms)
# 2. a list of descriptors for the smooths: smooth.spec
# this is function does the work, and is called by in interpret.gam
# 'textra' is optional text to add to term labels
# 'extra.special' is label of extra smooth within formula.
{ p.env <- environment(gf) # environment of formula
tf <- terms.formula(gf,specials=c("s","te","ti","t2",extra.special)) # specials attribute indicates which terms are smooth
terms <- attr(tf,"term.labels") # labels of the model terms
nt <- length(terms) # how many terms?
if (attr(tf,"response") > 0) { # start the replacement formulae
response <- as.character(attr(tf,"variables")[2])
} else {
response <- NULL
}
sp <- attr(tf,"specials")$s # array of indices of smooth terms
tp <- attr(tf,"specials")$te # indices of tensor product terms
tip <- attr(tf,"specials")$ti # indices of tensor product pure interaction terms
t2p <- attr(tf,"specials")$t2 # indices of type 2 tensor product terms
zp <- if (is.null(extra.special)) NULL else attr(tf,"specials")[[extra.special]]
off <- attr(tf,"offset") # location of offset in formula
## have to translate sp, tp, tip, t2p (zp) so that they relate to terms,
## rather than elements of the formula...
vtab <- attr(tf,"factors") # cross tabulation of vars to terms
if (length(sp)>0) for (i in 1:length(sp)) {
ind <- (1:nt)[as.logical(vtab[sp[i],])]
sp[i] <- ind # the term that smooth relates to
}
if (length(tp)>0) for (i in 1:length(tp)) {
ind <- (1:nt)[as.logical(vtab[tp[i],])]
tp[i] <- ind # the term that smooth relates to
}
if (length(tip)>0) for (i in 1:length(tip)) {
ind <- (1:nt)[as.logical(vtab[tip[i],])]
tip[i] <- ind # the term that smooth relates to
}
if (length(t2p)>0) for (i in 1:length(t2p)) {
ind <- (1:nt)[as.logical(vtab[t2p[i],])]
t2p[i] <- ind # the term that smooth relates to
}
if (length(zp)>0) for (i in 1:length(zp)) {
ind <- (1:nt)[as.logical(vtab[zp[i],])]
zp[i] <- ind # the term that smooth relates to
} ## re-referencing is complete
k <- kt <- kti <- kt2 <- ks <- kz <- kp <- 1 # counters for terms in the 2 formulae
len.sp <- length(sp)
len.tp <- length(tp)
len.tip <- length(tip)
len.t2p <- length(t2p)
len.zp <- length(zp)
ns <- len.sp + len.tp + len.tip + len.t2p + len.zp# number of smooths
pav <- av <- rep("",0)
smooth.spec <- list()
#mgcvat <- "package:mgcv" %in% search() ## is mgcv in search path?
mgcvns <- loadNamespace('mgcv')
if (nt) for (i in 1:nt) { # work through all terms
if (k <= ns&&((ks<=len.sp&&sp[ks]==i)||(kt<=len.tp&&tp[kt]==i)||(kz<=len.zp&&zp[kz]==i)||
(kti<=len.tip&&tip[kti]==i)||(kt2<=len.t2p&&t2p[kt2]==i))) { # it's a smooth
## have to evaluate in the environment of the formula or you can't find variables
## supplied as smooth arguments, e.g. k <- 5;gam(y~s(x,k=k)), fails,
## but if you don't specify namespace of mgcv then stuff like
## loadNamespace('mgcv'); k <- 10; mgcv::interpret.gam(y~s(x,k=k)) fails (can't find s)
## eval(parse(text=terms[i]),envir=p.env,enclos=loadNamespace('mgcv')) fails??
## following may supply namespace of mgcv explicitly if not on search path...
## If 's' etc are masked then we can fail even if mgcv on search path, hence paste
## of explicit mgcv reference into first attempt...
st <- try(eval(parse(text=paste("mgcv::",terms[i],sep="")),envir=p.env),silent=TRUE)
if (inherits(st,"try-error")) st <-
eval(parse(text=terms[i]),enclos=p.env,envir=mgcvns)
if (!is.null(textra)) { ## modify the labels on smooths with textra
pos <- regexpr("(",st$lab,fixed=TRUE)[1]
st$label <- paste(substr(st$label,start=1,stop=pos-1),textra,
substr(st$label,start=pos,stop=nchar(st$label)),sep="")
}
smooth.spec[[k]] <- st
if (ks<=len.sp&&sp[ks]==i) ks <- ks + 1 else # counts s() terms
if (kt<=len.tp&&tp[kt]==i) kt <- kt + 1 else # counts te() terms
if (kti<=len.tip&&tip[kti]==i) kti <- kti + 1 else # counts ti() terms
if (kt2<=len.t2p&&t2p[kt2]==i) kt2 <- kt2 + 1 # counts t2() terms
else kz <- kz + 1
k <- k + 1 # counts smooth terms
} else { # parametric
av[kp] <- terms[i] ## element kp on rhs of parametric
kp <- kp+1 # counts parametric terms
}
}
if (!is.null(off)) { ## deal with offset
av[kp] <- as.character(attr(tf,"variables")[1+off])
kp <- kp+1
}
pf <- paste(response,"~",paste(av,collapse=" + "))
if (attr(tf,"intercept")==0) {
pf <- paste(pf,"-1",sep="")
if (kp>1) pfok <- 1 else pfok <- 0
} else {
pfok <- 1;if (kp==1) {
pf <- paste(pf,"1");
}
}
fake.formula <- pf
if (length(smooth.spec)>0)
for (i in 1:length(smooth.spec)) {
nt <- length(smooth.spec[[i]]$term)
ff1 <- paste(smooth.spec[[i]]$term[1:nt],collapse="+")
fake.formula <- paste(fake.formula,"+",ff1)
if (smooth.spec[[i]]$by!="NA") {
fake.formula <- paste(fake.formula,"+",smooth.spec[[i]]$by)
av <- c(av,smooth.spec[[i]]$term,smooth.spec[[i]]$by)
} else av <- c(av,smooth.spec[[i]]$term)
}
fake.formula <- as.formula(fake.formula,p.env)
if (length(av)) {
pred.formula <- as.formula(paste("~",paste(av,collapse="+")))
pav <- all.vars(pred.formula) ## trick to strip out 'offset(x)' etc...
pred.formula <- reformulate(pav)
environment(pred.formula) <- environment(gf)
} else pred.formula <- ~1
ret <- list(pf=as.formula(pf,p.env),pfok=pfok,smooth.spec=smooth.spec,
fake.formula=fake.formula,response=response,fake.names=av,
pred.names=pav,pred.formula=pred.formula)
class(ret) <- "split.gam.formula"
ret
} ## interpret.gam0
interpret.gam <- function(gf,extra.special=NULL) {
## wrapper to allow gf to be a list of formulae or
## a single formula. This facilitates general penalized
## likelihood models in which several linear predictors
## may be involved...
##
## The list syntax is as follows. The first formula must have a response on
## the lhs, rather than labels. For m linear predictors, there
## must be m 'base formulae' in linear predictor order. lhs labels will
## be ignored in a base formula. Empty base formulae have '-1' on rhs.
## Further formulae have labels up to m labels 1,...,m on the lhs, in a
## syntax like this: 3 + 5 ~ s(x), which indicates that the same s(x)
## should be added to both linear predictors 3 and 5.
## e.g. A bivariate normal model with common expected values might be
## list(y1~-1,y2~-1,1+2~s(x)), whereas if the second component was contaminated
## by something else we might have list(y1~-1,y2~s(v)-1,1+2~s(x))
##
## For a list argument, this routine returns a list of split.formula objects
## with an extra field "lpi" indicating the linear predictors to which each
## contributes...
if (is.list(gf)) {
d <- length(gf)
## make sure all formulae have a response, to avoid
## problems with parametric sub formulae of the form ~1
#if (length(gf[[1]])<3) stop("first formula must specify a response")
resp <- gf[[1]][2]
ret <- list()
pav <- av <- rep("",0)
nlp <- 0 ## count number of linear predictors (may be different from number of formulae)
for (i in 1:d) {
textra <- if (i==1) NULL else paste(".",i-1,sep="") ## modify smooth labels to identify to predictor
lpi <- getNumericResponse(gf[[i]]) ## get linear predictors to which this applies, if explicit
if (length(lpi)==1) warning("single linear predictor indices are ignored")
if (length(lpi)>0) gf[[i]][[2]] <- NULL else { ## delete l.p. labels from formula response
nlp <- nlp + 1;lpi <- nlp ## this is base formula for l.p. number nlp
}
ret[[i]] <- interpret.gam0(gf[[i]],textra,extra.special=extra.special)
ret[[i]]$lpi <- lpi ## record of the linear predictors to which this applies
## make sure all parametric formulae have a response, to avoid
## problems with parametric sub formulae of the form ~1
respi <- rep("",0) ## no extra response terms
if (length(ret[[i]]$pf)==2) {
ret[[i]]$pf[3] <- ret[[i]]$pf[2];ret[[i]]$pf[2] <- resp
respi <- rep("",0)
} else if (i>1) respi <- ret[[i]]$response ## extra response terms
av <- c(av,ret[[i]]$fake.names,respi) ## accumulate all required variable names
pav <- c(pav,ret[[i]]$pred.names) ## predictors only
}
av <- unique(av) ## strip out duplicate variable names
pav <- unique(pav)
if (length(av)>0) {
## work around - reformulate with response = "log(x)" will treat log(x) as a name,
## not the call it should be...
fff <- formula(paste(ret[[1]]$response,"~ ."))
ret$fake.formula <- reformulate(av,response=ret[[1]]$response)
environment(ret$fake.formula) <- environment(gf[[1]])
ret$fake.formula[[2]] <- fff[[2]] ## fix messed up response
} else ret$fake.formula <- ret[[1]]$fake.formula ## create fake formula containing all variables
ret$pred.formula <- if (length(pav)>0) reformulate(pav) else ~1 ## predictor only formula
environment(ret$pred.formula) <- environment(gf[[1]])
ret$response <- ret[[1]]$response
ret$nlp <- nlp ## number of linear predictors
for (i in 1:d) if (max(ret[[i]]$lpi)>nlp||min(ret[[i]]$lpi)<1) stop("linear predictor labels out of range")
class(ret) <- "split.gam.formula"
return(ret)
} else interpret.gam0(gf,extra.special=extra.special)
} ## interpret.gam
## Now some test code.
environment(interpret.gam) <- environment(mgcv::interpret.gam)
environment(interpret.gam0) <- environment(mgcv:::interpret.gam0)
assignInNamespace("interpret.gam", interpret.gam, "mgcv")
assignInNamespace("interpret.gam0", interpret.gam0, "mgcv")
set.seed(123)
mod.gam.1 <- local({
var.fun <- function(x){x^2}
x <- runif(100)
y <- x + rnorm(100, 0, var.fun(x))
gam(formula = list(y ~ x,
~ var.fun(x)),
family = gaulss(link = list("log", "logb")))
})
pred.fun.1 <- function(x){
predict(mod.gam.1,
newdata = data.frame("x" = x))
}
pred.fun.1(1)