Caret package - cross-validating GAM with both smooth and linear predictors

I would like to cross validate a GAM model using caret. My GAM model has a binary outcome variable, an isotropic smooth of latitude and longitude coordinate pairs, and then linear predictors. Typical syntax when using mgcv is:

gam1 <- gam( y ~ s(lat , long) + x1 + x2, family = binomial(logit) )

I'm not quite sure how to specify this model using the train function in caret. This is my syntax more or less:

cv <- train(y ~ lat + long + x1 + x2, 
            data = data, 
            method = "gam", 
            family = "binomial", 
            trControl = trainControl(method = "LOOCV", number=1, repeats=), 
            tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))

The problem is that I only want lat and long to be smoothed and x1 and x2 to be treated as linear.



  • It is very interesting to see someone using mgcv outside mgcv. After a bit of research, I am here to frustrate you: using mgcv with caret is a bad idea, at least with current support from caret.

    Let's me just ask you a few fundamental questions, if you are using caret:

    1. How can you specify the number of knots, as well as spline basis class for a smooth function?
    2. How can you specify 2D smooth function?
    3. How can you specify tensor product spline with te or ti?
    4. How can you tweak with smoothing parameters?

    If you want to know what caret::train is doing with method = "gam", check out its fitting routine:

    getModelInfo(model = "gam", regex = FALSE)$gam$fit
    function(x, y, wts, param, lev, last, classProbs, ...) { 
                dat <- if( x else
                modForm <- caret:::smootherFormula(x)
                if(is.factor(y)) {
                  dat$.outcome <- ifelse(y == lev[1], 0, 1)
                  dist <- binomial()
                } else {
                  dat$.outcome <- y
                  dist <- gaussian()
                modelArgs <- list(formula = modForm,
                                  data = dat,
                                  select = param$select, 
                                  method = as.character(param$method))
                ## Intercept family if passed in
                theDots <- list(...)
                if(!any(names(theDots) == "family")) modelArgs$family <- dist
                modelArgs <- c(modelArgs, theDots)                 
                out <-"gam", "mgcv"), modelArgs)

    You see the modForm <- caret:::smootherFormula(x) line? That line is the key, while other lines is just routine construction of a model call. So, let's have a check with what GAM formula caret is constructing:

    function (data, smoother = "s", cut = 10, df = 0, span = 0.5, 
        degree = 1, y = ".outcome") 
        nzv <- nearZeroVar(data)
        if (length(nzv) > 0) 
            data <- data[, -nzv, drop = FALSE]
        numValues <- sort(apply(data, 2, function(x) length(unique(x))))
        prefix <- rep("", ncol(data))
        suffix <- rep("", ncol(data))
        prefix[numValues > cut] <- paste(smoother, "(", sep = "")
        if (smoother == "s") {
            suffix[numValues > cut] <- if (df == 0) 
            else paste(", df=", df, ")", sep = "")
        if (smoother == "lo") {
            suffix[numValues > cut] <- paste(", span=", span, ",degree=", 
                degree, ")", sep = "")
        if (smoother == "rcs") {
            suffix[numValues > cut] <- ")"
        rhs <- paste(prefix, names(numValues), suffix, sep = "")
        rhs <- paste(rhs, collapse = "+")
        form <- as.formula(paste(y, rhs, sep = "~"))

    In short, it creates additive, univariate smooth. This is the classic form when GAM was first proposed.

    To this end, you lose a significant amount of control on mgcv, as listed previously.

    To verify this, let me construct a similar example to your case:

    dat <- gamSim(eg = 2, scale = 0.2)$data[1:3]
    dat$a <- runif(400)
    dat$b <- runif(400)
    dat$y <- with(dat, y + 0.3 * a - 0.7 * b)
    #            y         x         z          a         b
    #1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130
    #2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050
    #3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970
    #4 -0.17002059 0.5728534 0.9543781 0.03391887 0.8669163
    #5  0.55452069 0.9082078 0.8978485 0.91608902 0.4377153
    #6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378

    So we aim to fit a model: y ~ s(x, z) + a + b. The data y is Gaussian, but this does not matter; it does not affect how caret works with mgcv.

    cv <- train(y ~ x + z + a + b, data = dat, method = "gam", family = "gaussian",
                trControl = trainControl(method = "LOOCV", number=1, repeats=1), 
                tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))

    You can extract the final model:

    fit <- cv[[11]]

    So what formula is it using?

    #.outcome ~ s(x) + s(z) + s(a) + s(b)

    See? Apart from being "additive, univariate", it also leaves everything of mgcv::s to its default: default bs = "tp", default k = 10, etc.