Search code examples
rfunctionsapply

Problem passing column name to sapply within a function


I need to calculate a lot of predicted probabilities for multiple logit models, and I'm trying to write a function to speed up the process. I'm having trouble making my function work correctly, however. The problem seems to be the "iv=x" portion of the code below. I'm not sure how to correctly pass the column name there.

pp <- function(iv, model, df) {
  lev <- levels(df[[iv]])
  l.prob <- sapply(lev, FUN=function(x){
  mean(predict(model, type = "response", 
               newdata = mutate(df, iv = x)), na.rm=TRUE)
  })
  l.prob
}


test <- pp(iv="myvar", model=model1, df=mydf)
test

Here is some example data showing how the function isn't working:

set.seed(123123)
df=data.frame(y=sample(c(0,1), replace=TRUE, size=100), x1=as.factor(rep(c("value1", "value2"), 50)), x2=rnorm(100, mean=50, sd=10))


logit1 <- glm(y ~ x1+x2, data = df, family=binomial(link="logit"))
summary(logit1)


#what the predicted probabilities should be (0.4173400, 0.4625565)
lev <- levels(df$x1)
pp <- sapply(lev, FUN=function(x){
  mean(predict(logit1, type = "response", 
               newdata = mutate(df, x1 = x)), na.rm=TRUE)
})
pp

#now running function (produces probabilities 0.44 and 0.44)

pp <- function(iv, model, df) {
  lev <- levels(df[[iv]])
  l.prob <- sapply(lev, FUN=function(x){
    mean(predict(model, type = "response", 
                 newdata = mutate(df, iv = x)), na.rm=TRUE)
  })
  l.prob
}


test <- pp(iv="x1", model=logit1, df=df)
test

Solution

  • Consider dynamically assigning column before prediction using [[ and avoid mutate (especially if it is the only method used in dplyr and can save you a library call).

    pp <- function(iv, model, df) {
      lev <- levels(df[[iv]])
      l.prob <- sapply(lev, FUN=function(x){
            df[[iv]] <- x
            mean(predict(model, type = "response", newdata = df), na.rm=TRUE)
      })
    }
    

    Another base R method is to add new column with a temp name and then rename all columns with dynamic parameter.

      l.prob <- sapply(lev, FUN=function(x){
            mean(predict(model, type = "response", 
                         newdata = setNames(transform(df, tmp = x), c(colnames(df), iv)), 
                 na.rm=TRUE)
      })