Search code examples
rmachine-learningr-caretglmnet

How can I train a glmnet model (Poisson family) with an offset term using the caret package in R?


I want to model insurance claim count using a Poisson glmnet. The data I have at hand contains the number of claims for each policy (which is the response variable), some features about the policy (gender, region, etc.) as well as the duration of the policy (in years). I want to include the log-duration as an offset term, as we usually do in actuarial science. With the cv.glmnet function of the glmnet package, it is straightforward:

library(tidyverse)
library(glmnet)

n <- 100

dat <- tibble(
 nb_claims = rpois(n, lambda = 0.5),
 duration = runif(n),
 x1 = runif(n),
 x2 = runif(n),
 x3 = runif(n)
)


fit <- cv.glmnet(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  family = "poisson",
  offset = dat %>% pull(duration) %>% log()
)

fit

However, my goal is to train this model using the train function of the caret package, because of the many advantages it gives. Indeed, validation, preprocessing as well as feature selection is much better with this package. It is straightforward to train a basic glmnet (without an offset term) with caret:

library(caret)

fit <- caret::train(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  method = "glmnet",
  family = "poisson"
)

fit

Naively, we could try to add the offset argument in the train function:

fit <- caret::train(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  method = "glmnet",
  family = "poisson",
  offset = dat %>% pull(duration) %>% log()
)

fit

Unfortunately, this code throws the error Error : No newoffset provided for prediction, yet offset used in fit of glmnet. This error occurs because the caret::train function doesn't take care to give a value for the newoffset argument in predict.glmnet function.

In this book, they show how to add an offset term to a GLM model by modifying the source code of the caret::train function. It works perfectly. However, the predict.glm function is quite different from the predict.glmnet function, because it does not have the newoffset argument. I tried to modify the source code of the caret::train function, but I am having some trouble because I do not know well enough how this function works.


Solution

  • A simple way to perform this is pass the offset column as part of x and in each fit and predict call pass as x columns of x which are not the offset. While as offset/newoffset pass the x column corresponding to the offset.

    In the following example the offest column of x needs to be named "offset" too. This can be changed relatively easy

    To create the function we will just use lots of parts from: https://github.com/topepo/caret/blob/master/models/files/glmnet.R

    glmnet is peculiar since it needs a loop, the rest is just rinse and reapeat from https://topepo.github.io/caret/using-your-own-model-in-train.html#illustrative-example-1-svms-with-laplacian-kernels

    family = "poisson" will be specified throughout, to change this adopt code from https://github.com/topepo/caret/blob/master/models/files/glmnet.R

    glmnet_offset <- list(type = "Regression",
                          library = c("glmnet", "Matrix"),
                          loop = function(grid) {
                            alph <- unique(grid$alpha)
                            loop <- data.frame(alpha = alph)
                            loop$lambda <- NA
                            submodels <- vector(mode = "list", length = length(alph))
                            for(i in seq(along = alph)) {
                              np <- grid[grid$alpha == alph[i],"lambda"]
                              loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)]
                              submodels[[i]] <- data.frame(lambda = np[-which.max(np)])
                            }
                            list(loop = loop, submodels = submodels)
                          }) 
    
    
    glmnet_offset$parameters <- data.frame(parameter = c('alpha', 'lambda'),
                                           class = c("numeric", "numeric"),
                                           label = c('Mixing Percentage', 'Regularization Parameter'))
    
    
    glmnet_offset$grid <- function(x, y, len = NULL, search = "grid") {
      if(search == "grid") {
        init <- glmnet::glmnet(Matrix::as.matrix(x[,colnames(x) != "offset"]), y,
                               family = "poisson",
                               nlambda = len+2,
                               alpha = .5,
                               offset = x[,colnames(x) == "offset"])
        lambda <- unique(init$lambda)
        lambda <- lambda[-c(1, length(lambda))]
        lambda <- lambda[1:min(length(lambda), len)]
        out <- expand.grid(alpha = seq(0.1, 1, length = len),
                           lambda = lambda)
      } else {
        out <- data.frame(alpha = runif(len, min = 0, 1),
                          lambda = 2^runif(len, min = -10, 3))
      }
      out
    }
    

    So x[,colnames(x) != "offset"] is x while offset is x[,colnames(x) == "offset"]

    glmnet_offset$fit <- function(x, y, wts, param, last, ...) {
    
      theDots <- list(...)
    
    
      ## pass in any model weights
      if(!is.null(wts)) theDots$weights <- wts
    
      if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
        x <- Matrix::as.matrix(x)
      modelArgs <- c(list(x = x[,colnames(x) != "offset"],
                          y = y,
                          alpha = param$alpha,
                          family = "poisson",
                          offset = x[,colnames(x) == "offset"]),
                     theDots)
    
      out <- do.call(glmnet::glmnet, modelArgs)
      if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
      out
    }
    
    glmnet_offset$predict <- function(modelFit, newdata, submodels = NULL) {
      if(!is.matrix(newdata)) newdata <- Matrix::as.matrix(newdata)
        out <- predict(modelFit,
                       newdata[,colnames(newdata) != "offset"],
                       s = modelFit$lambdaOpt,
                       newoffset = newdata[,colnames(newdata) == "offset"],
                       type = "response") #important for measures to be appropriate
    
      if(is.matrix(out)) out <- out[,1]
      out
      if(!is.null(submodels)) {
          tmp <- as.list(as.data.frame(predict(modelFit,
                                              newdata[,colnames(newdata) != "offset"],
                                              s = submodels$lambda,
                                              newoffset = newdata[,colnames(newdata) == "offset"],
                                              type = "response"),
                                       stringsAsFactors = TRUE))
        out <- c(list(out), tmp)
    
      }
      out
    
    }
    

    For some reason which I don't understand yet it does not work without the prob slot

    glmnet_offset$prob <- glmnet_offset$predict
    
    
    glmnet_offset$tags = c("Generalized Linear Model", "Implicit Feature Selection",
                           "L1 Regularization", "L2 Regularization", "Linear Classifier",
                           "Linear Regression")
    
    
    glmnet_offset$sort = function(x) x[order(-x$lambda, x$alpha),]
    glmnet_offset$trim = function(x) {
      x$call <- NULL
      x$df <- NULL
      x$dev.ratio <- NULL
      x
    }
    
    library(tidyverse)
    library(caret)
    library(glmnet)
    
    n <- 100
    set.seed(123)
    dat <- tibble(
      nb_claims = rpois(n, lambda = 0.5),
      duration = runif(n),
      x1 = runif(n),
      x2 = runif(n),
      x3 = runif(n)
    )
    
    x = dat %>%
      dplyr::select(-nb_claims) %>%
      mutate(offset = log(duration)) %>%
      dplyr::select(-duration) %>%
      as.matrix
    
    fit <- caret::train(
      x = x,
      y = dat %>% pull(nb_claims),
      method = glmnet_offset,
    )
    
    fit
    100 samples
      4 predictor
    
    No pre-processing
    Resampling: Bootstrapped (25 reps) 
    Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
    Resampling results across tuning parameters:
    
      alpha  lambda        RMSE       Rsquared    MAE      
      0.10   0.0001640335  0.7152018  0.01805762  0.5814200
      0.10   0.0016403346  0.7152013  0.01805684  0.5814193
      0.10   0.0164033456  0.7130390  0.01798125  0.5803747
      0.55   0.0001640335  0.7151988  0.01804917  0.5814020
      0.55   0.0016403346  0.7150312  0.01802689  0.5812936
      0.55   0.0164033456  0.7095996  0.01764947  0.5783706
      1.00   0.0001640335  0.7152033  0.01804795  0.5813997
      1.00   0.0016403346  0.7146528  0.01798979  0.5810811
      1.00   0.0164033456  0.7063482  0.01732168  0.5763653
    
    RMSE was used to select the optimal model using the smallest value.
    The final values used for the model were alpha = 1 and lambda = 0.01640335.
    
    predict(fit$finalModel,  x[,1:3], newoffset = x[,4]) #works
    

    This will not work with preprocessing in caret since we pass offset as one of the features. However it will work with recipes since you can define columns on which preprocessing functions will be performed via selections. Se article for details: https://tidymodels.github.io/recipes/articles/Selecting_Variables.html

    I haven't had time to error check my code. If any problems occur or if there is a mistake somewhere please comment. Thanks.

    You can also post an issue in caret github asking this feature (offset/newoffset) to be added to the model