Search code examples
machine-learningpcacross-validationr-caret

PCA within cross validation; however, only with a subset of variables


This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?

library(caret)
library(mlbench)
data(PimaIndiansDiabetes)

control <- trainControl(method="cv", 
                        number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
                     method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))

model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf", 
               preProcess= p, 
               trControl=control,
               tuneGrid=grid)

But I get this error:

Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr

The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!


Solution

  • As far as I know this is not possible with caret. This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:

    library(mlr3)
    library(mlr3pipelines)
    library(mlr3learners)
    library(paradox)
    library(mlr3tuning)
    library(mlbench)
    

    create a task from the data:

    data("PimaIndiansDiabetes")
    
    pima_tsk <- TaskClassif$new(id = "Pima",
                                backend = PimaIndiansDiabetes,
                                target = "diabetes")
    

    define a pre process selector named "slct1":

    pos1 <- po("select", id = "slct1")
    

    and define the selector function within it:

    pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
    

    now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))

    pos1 %>>%
      po("scale", id = "scale1") %>>%
      po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
    

    now define an invert selector:

    pos2 <- po("select", id = "slct2")
    
    pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
    

    define the learner:

    rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
    

    combine them:

    gunion(list(pr1, pos2)) %>>%
      po("featureunion") %>>%
      rf_lrn -> graph
    

    check if it looks ok:

    graph$plot(html = TRUE)
    

    enter image description here

    convert graph to a learner:

    glrn <- GraphLearner$new(graph)
    

    define parameters you want tuned:

    ps <-  ParamSet$new(list(
      ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
      ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
    

    define resampling:

    cv10 <- rsmp("cv", folds = 10)
    

    define tuning:

    instance <- TuningInstance$new(
      task = pima_tsk,
      learner = glrn,
      resampling = cv10,
      measures = msr("classif.ce"),
      param_set = ps,
      terminator = term("evals", n_evals = 20)
    )
    
    set.seed(1)
    tuner <- TunerRandomSearch$new()
    tuner$tune(instance)
    instance$result
    

    For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?

    If you find this interesting check out the mlr3book

    Also

    cor(PimaIndiansDiabetes[, 4:6])
              triceps   insulin      mass
    triceps 1.0000000 0.4367826 0.3925732
    insulin 0.4367826 1.0000000 0.1978591
    mass    0.3925732 0.1978591 1.0000000
    

    does not produce what you mention in the question.