Search code examples
rmachine-learningsvmr-caretkernlab

cross-validate predictions for caret and svm


There seem to be differences between the ROC/Sens/Spec that is produced when tuning the model, to the actual predictions made by the model on the same dataset. I'm using caret which uses kernlab's ksvm. I'm not experiencing this problem with glm.

data(iris)
library(caret)
iris <- subset(iris,Species == "versicolor" | Species == "setosa") # we need only two output classess
iris$noise <- runif(nrow(iris)) # add noise - otherwise the model is too "perfect"
iris$Species <- factor(iris$Species)
fitControl <- trainControl(method = "repeatedcv",number = 10, repeats = 5, savePredictions = TRUE, classProbs = TRUE, summaryFunction = twoClassSummary)

ir <- train(Species ~ Sepal.Length + noise, data=iris,method = "svmRadial", preProc = c("center", "scale"), trControl=fitControl,metric="ROC")
confusionMatrix(predict(ir), iris$Species, positive = "setosa")
getTrainperf(ir) # same as in the model summary

What is the source of this discrepancy? which ones are the "real", post-cross-validation predictions?


Solution

  • It seems the function getTrainPerf gives the mean performance results of the best tuned parameters averaged across the repeated cross validations folds.

    Here is how getTrainPerf works:

    getTrainPerf(ir) 
    #  TrainROC TrainSens TrainSpec    method
    #1   0.9096     0.844     0.884 svmRadial
    

    which is achieved in the following way:

    ir$results
    #      sigma    C    ROC  Sens  Spec      ROCSD    SensSD    SpecSD
    #1 0.7856182 0.25 0.9064 0.860 0.888 0.09306044 0.1355262 0.1222911
    #2 0.7856182 0.50 0.9096 0.844 0.884 0.08882360 0.1473023 0.1218229
    #3 0.7856182 1.00 0.8968 0.836 0.884 0.09146071 0.1495026 0.1218229
    ir$bestTune
    #      sigma   C
    #2 0.7856182 0.5
    merge(ir$results, ir$bestTune)
    #      sigma   C    ROC  Sens  Spec     ROCSD    SensSD    SpecSD
    #1 0.7856182 0.5 0.9096 0.844 0.884 0.0888236 0.1473023 0.1218229
    

    which can also be obtained from the performance results on the cross validation folds (10 folds, 5 repeats, 10*5=50 total values for each of the performance measures).

    colMeans(ir$resample[1:3])
    #     ROC   Sens   Spec 
    #  0.9096 0.8440 0.8840 
    

    Hence, getTrainPerfonly gives the summary of the cross-validation performances on the data folds held-out for validation at different times (not on the entire training dataset) with the best tuned parameters (sigma, C).

    But if you want to predict on your entire training dataset, you need to use the predict function with the tuned model.