Search code examples
rmachine-learningr-caretglm

Different model accuracy when rerunning preProcess(), predict() and train() in R (caret)


The data below is just an example, it is operations on this , or any, data which I am confused about:

library(caret)
set.seed(3433)
data(AlzheimerDisease)
complete <- data.frame(diagnosis, predictors)
in_train <- createDataPartition(complete$diagnosis, p = 0.75)[[1]]
training <- complete[in_train,]
testing <- complete[-in_train,]
predIL <- grep("^IL", names(training))
smalltrain <- training[, c(1, predIL)]

fit_noPCA <- train(diagnosis ~ ., method = "glm", data = smalltrain)
pre_proc_obj <- preProcess(smalltrain[,-1], method = "pca", thresh = 0.8)
smalltrainsPCs <- predict(pre_proc_obj, smalltrain[,-1])
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm")
fit_noPCA$results$Accuracy
fit_PCA$results$Accuracy

When running this code, I get a 0.689539 accuracy for fit_noPCA and 0.682951 accuracy for fit_PCA. But when I rerun the last portion of the code:

fit_noPCA <- train(diagnosis ~ ., method = "glm", data = smalltrain)
pre_proc_obj <- preProcess(smalltrain[,-1], method = "pca", thresh = 0.8)
smalltrainsPCs <- predict(pre_proc_obj, smalltrain[,-1])
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm")
fit_noPCA$results$Accuracy
fit_PCA$results$Accuracy

Then each time I rerun these 6 lines I get different accuracy values. Why is this so? Is it because I am not resetting the seed? Even if, where is the inherent randomness of this process?


Solution

  • By default, the model is trained using bootstrap, you can see it here:

    library(caret)
    library(AppliedPredictiveModeling)
    
    > fit_noPCA
    Generalized Linear Model 
    
    251 samples
     12 predictor
      2 classes: 'Impaired', 'Control' 
    
    No pre-processing
    Resampling: Bootstrapped (25 reps) 
    Summary of sample sizes: 251, 251, 251, 251, 251, 251, ... 
    Resampling results:
    
      Accuracy   Kappa     
      0.6870006  0.04107016
    

    So with every train , the bootstrapped samples will be different, to get back the same result, you can set the seed before running train:

    set.seed(111)
    fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
    fit_PCA$results$Accuracy
    [1] 0.6983512
    
    set.seed(112)
    fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
    fit_PCA$results$Accuracy
    [1] 0.6991537
    
    set.seed(111)
    fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
    fit_PCA$results$Accuracy
    [1] 0.6983512
    

    Or use for example cv where you can define the folds using index= in trainControl