Search code examples
rmachine-learningpcar-caret

PCA threshold tuning in Caret


I am trying to build a classifier from some data using caret. One of the approaches I want to try is a simple LDA from data pre-processed with PCA. I found out how to use caret for this:

fitControl <- trainControl("repeatedcv", number=10, repeats = 10,
                                preProcOptions = list(thresh = 0.9))
ldaFit1 <- train(label ~ ., data = tab,
                method = "lda2",
                preProcess = c("center", "scale", "pca"),
                trControl = fitControl)

As expected caret is comparing the accuracy of the LDA with different dimensions values:

Linear Discriminant Analysis

 158 samples
1955 predictors
   3 classes: '1', '2', '3'

Pre-processing: centered (1955), scaled (1955), principal component
 signal extraction (1955)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 142, 142, 143, 142, 143, 142, ...
Resampling results across tuning parameters:

  dimen  Accuracy   Kappa
  1      0.5498987  0.1151681
  2      0.5451340  0.1298590

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was dimen = 1.

What I would like to do is to add the PCA threshold to the tuning parameters, however I cannot find a way to do this.

Is there a simple solution for this with caret? Or does one need to repeat the training step with different pre-processing options and select the best value in the end?


Solution

  • Thanks to the links pointed out by missuse I managed to integrate the variance explained threshold of PCA to the parameter tuning:

    library(caret)
    library(recipes)
    library(MASS)
    
    # Setting up a vector of thresholds to try out
    pca_varex <- c(0.8, 0.9, 0.95, 0.97, 0.98, 0.99, 0.995, 0.999)
    
    # Setting up recipe
    initial_recipe <- recipe(train, formula = label ~ .) %>%
                        step_center(all_predictors()) %>%
                        step_scale(all_predictors())
    
    # Define the modelgrid
    models <- model_grid() %>%
                share_settings(data = train,
                                trControl = caret::trainControl(method = "repeatedcv",
                                                            number = 10,
                                                            repeats = 10),
                                method = "lda2") 
    
    # Add models with different PCA thresholds
    for (i in pca_varex) {
        models <- models %>% add_model(model_name = sprintf("varex_%s", i),
                                        x = initial_recipe %>%
                                            step_pca(all_predictors(), threshold = i))
    }
    
    # Train
    models <- models %>% train(.)
    

    Though looking up the modelgrid and recipes documentation the tidymodels package may be the most straightforward way (https://www.tidymodels.org/).