Search code examples
rfunctionloopslme4

R commands behave incorrectly when converted to a function: failed evaluation of variable in for loop


I have the following reproducible example, which calculates a simple neighbourhood cross validation on a mixed regression model. As shown, if I turn the group of commands into a function, they behave incorrectly, with a fitted lmer model not updating at each for loop instance, instead always taking the value of the final for loop instance.

This seems to be an R programming quirk, and I'd be curious to know what's happening as it's something that could easily slip under the radar in future.

Reproducible example:

library(lme4)

set.seed(2002)
dataf <- data.frame( x = rnorm(15), p = rep(1:5, each = 3), rp = rep(rnorm(5),each = 3))
dataf$y <- rnorm(15, dataf$x + dataf$rp)
mod <- lmer( y ~ x + (1|p), data = dataf) 

# Inner commands for function:
dataf <- getData(mod)
pvec <- unique(dataf$p)
r2vec  <- numeric(0)
for ( pnow in pvec ) { 
  selp <- dataf$p == pnow
  newmod <- update(mod, data = dataf[!selp,])
  print(paste(pnow, fixef(newmod),collapse = ", "))
  r2vec <- c( r2vec,
    (dataf[selp,]$y - predict(newmod,newdata = dataf[selp,], allow.new.levels = T))^2
  )
}
# [1] "1 0.285049274214266, 1 0.95048020397442"
# [1] "2 0.185721150305479, 2 0.901716963753631"
# [1] "3 0.810224187608013, 3 0.615735520886265"
# [1] "4 0.869392151461034, 4 0.968302207873863"
# [1] "5 0.454297693044713, 5 0.914047431881036"
mean(r2vec) # 2.241611

# Function with inner commands as above:
lopofun <- function(mod) {
  dataf <- getData(mod)
  pvec <- unique(dataf$p)
  r2vec  <- numeric(0)
  for ( pnow in pvec ) { 
    selp <- dataf$p == pnow
    newmod <- update(mod, data = dataf[!selp,])
    print(paste(pnow, fixef(newmod),collapse = ", "))
    r2vec <- c( r2vec,
      (dataf[selp,]$y - predict(newmod,newdata = dataf[selp,], allow.new.levels = T))^2
    )
  }
  mean(r2vec)
}
lopofun(mod) # 0.3388175
# [1] "1 0.454297693044713, 1 0.914047431881036"
# [1] "2 0.454297693044713, 2 0.914047431881036"
# [1] "3 0.454297693044713, 3 0.914047431881036"
# [1] "4 0.454297693044713, 4 0.914047431881036"
# [1] "5 0.454297693044713, 5 0.914047431881036"

As the inserted print function shows, newmod is not updating, but instead taking what should be the final value in every instance of the for loop.

I'd like to continue using update(), as it would be clumsy to try to extract model features and refit manually. No doubt there's a nicer way to code this without the for loop, nonetheless the for loop is a simple structure, and I'm not sure why it should fail here.

Edit: I appreciate a reviewer suggestion that this is a lazy evaluation issue and that using sapply() instead of a for loop might fix things. This doesn't match my (rudimentary) understanding of lazy evaluation. print() needs to evaluate both the loop variable pnow, and use newmod: it returns correct pnow values, but not newmod values.

Below is an ugly attempt at implementing the same thing using sapply() rather than the for loop. However, here the raw commands and the function both fail.

# Inner commands using sapply
dataf <- getData(mod)
pvec <- unique(dataf$p)
onepfun <- function(pnow) {
  selp <- dataf$p == pnow
  newmod <- update(mod, data = dataf[!selp,])
  print(paste(pnow, paste(fixef(newmod),collapse = ", ")))
  (dataf[selp,]$y - predict(newmod,newdata = dataf[selp,], 
    allow.new.levels = T))^2
}
bob <- sapply(pvec,onepfun)
dim(bob) <- NULL
mean(bob) # 0.3388175
# [1] "1 0.454297693044713, 0.914047431881036"
# [1] "2 0.454297693044713, 0.914047431881036"
# [1] "3 0.454297693044713, 0.914047431881036"
# [1] "4 0.454297693044713, 0.914047431881036"
# [1] "5 0.454297693044713, 0.914047431881036"


# Function attempt with sapply
lopofun2 <- function(mod) {
  dataf <- getData(mod)
  pvec <- unique(dataf$p)
  onepfun <- function(pnow) {
    selp <- dataf$p == pnow
    newmod <- update(mod, data = dataf[!selp,])
    print(paste(pnow, paste(fixef(newmod),collapse = ", ")))
    (dataf[selp,]$y - predict(newmod,newdata = dataf[selp,], 
      allow.new.levels = T))^2
  }
  bob <- sapply(pvec,onepfun)
  dim(bob) <- NULL
  mean(bob)
}
lopofun2(mod) # 0.3388175
# [1] "1 0.454297693044713, 0.914047431881036"
# [1] "2 0.454297693044713, 0.914047431881036"
# [1] "3 0.454297693044713, 0.914047431881036"
# [1] "4 0.454297693044713, 0.914047431881036"
# [1] "5 0.454297693044713, 0.914047431881036"

Edit 2:

Thank you @Roland @Luigi and @Konrad-Rudolph for your input.

Konrad's linked post is a bit out of my depth, but from the answers below it seems clear that update() is behaving in a way I did not expect. From a novice point of view, I see that dataf from the global environment is smuggled into the function via the mod argument. update() then uses this environment as a first port-of-call for updating the data argument of the model. In my example it finds both dataf and selp globally, and proceeds. It only evaluates in the function environment when the global environment fails: in my case this can happen when selp is removed from the global environment, or dataf or selp variables within the function are renamed.

Proceeding a little further from Roland's check, printing the environment of the formula of newmod confirms this.

lopofun6 <- function(mod) {
  dataf <- getData(mod)
  pvec <- unique(dataf$p)
  r2vec  <- numeric(0)
  for ( pnow in pvec ) { 
    selp <- dataf$p == pnow
    newmod <- update(mod, data = dataf[!selp,])
    print(environment(formula(newmod)))
    r2vec <- c( r2vec,
      (dataf[selp,]$y - predict(newmod,newdata = dataf[selp,],
      allow.new.levels = T))^2
    )
  }
  mean(r2vec)
}
lopofun6(mod) 
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
# [1] 0.3388175

Thank you Roland for the safe workaround for update().


Solution

  • The last lines of code in lme4:::update.merMod are these:

    ff <- environment(formula(object))
    pf <- parent.frame()
    sf <- sys.frames()[[1]]
    tryCatch(eval(call, envir = ff), error = function(e) {
        tryCatch(eval(call, envir = sf), error = function(e) {
            eval(call, envir = pf)
        })
    })
    

    The function tries to handle scoping issues by trying to evaluate inside different environments.

    The issue is that the first try is successfully evaluated in the global environment:

    environment(formula(mod))
    #<environment: R_GlobalEnv>
    

    The safest solution is to control the scope of evaluation yourself:

    lopofun <- function(mod) {
      dataf <- getData(mod)
      pvec <- unique(dataf$p)
      r2vec  <- list()
      for ( pnow in pvec ) { 
        selp <- dataf$p == pnow
        
        #don't evaluate
        newmod <- update(mod, data = dataf[!selp,], evaluate = FALSE)
        
        #now evaluate and ensure look-up goes directly to 
        #the evaluation environment of lopofun
        newmod <- eval(newmod, envir = NULL)
        
        print(paste(pnow, fixef(newmod),collapse = ", "))
        r2vec[[pnow]] <- (dataf[selp,]$y - predict(newmod,newdata = dataf[selp,], allow.new.levels = T))^2
      }
      mean(do.call(c, r2vec))
    }
    
    
    lopofun(mod)
    #[1] "1 0.285049274214266, 1 0.95048020397442"
    #[1] "2 0.185721150305479, 2 0.901716963753631"
    #[1] "3 0.810224187608013, 3 0.615735520886265"
    #[1] "4 0.869392151461034, 4 0.968302207873863"
    #[1] "5 0.454297693044713, 5 0.914047431881036"
    #[1] 2.241611
    

    PS: I think @BenBolker should take another look at the approach. I think it would be better to fail with an error than to get values with unexpected scope.