Search code examples
rsvmpredictionauce1071

SVM performance not consistent with AUC score


I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick). I have tried to implement an SVM model to predict patient status based on these variables.

library(e1071)

Index <- 
  order(Ytrain, decreasing = FALSE)

SVMfit_Var <- 
  svm(Xtrain[Index, ], Ytrain[Index],
      type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)


preds1 <- 
  predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <- 
  attr(preds1, "probabilities")[,1]

samples <- !is.na(Ytest)
  pred <- prediction(preds1[samples],Ytest[samples])
  AUC<-performance(pred,"auc")@y.values[[1]]


prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)

To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997. But when I view the predictions, all the patients have been assigned as healthy.

AUC = 0.997
> xtab
     prediction
Ytest  0  1
    0 72  0
    1 52  0

Can anyone help me with this problem?


Solution

  • Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.

    If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)

    library(DescTools)
    library(caret)
    
    # for the training performance with DescTools
    Conf(table(SVMfit_Var$fitted, Ytrain[Index])) 
           # svm.model$fitted, y-values for training
    
    # training performance with caret
    confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index])) 
                 # svm.model$fitted, y-values 
                           # if y.values aren't factors, use as.factor()
    
    # for testing performance with DescTools
        # with `table()` in your question, you must flip the order:
             # predicted first, then actual values
    Conf(table(prediction, Ytest))
    
    # and for caret
    confusionMatrix(prediction, as.factor(Ytest))
    

    Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.

    library(e1071)
    library(ROCR)
    library(caret)
    
    data("iris")
    
    # make it binary
    df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
    # check the subset
    summary(df1)
    
    set.seed(395) # keep the sample repeatable
    tr <- sample(1:nrow(df1), size = 70, # 70%
                 replace = F)
    
    # create the model
    svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
                   type = "C-classification",
                   gamma = .005, probability = T,
                   cost = .001, epsilon = .1)
    
    # look at probabilities
    pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T) 
                # this shows EVERY row has the same outcome probability distro
    pb.fit <- attr(pb.fit, "probabilities")[,1]
    
    # look at performance 
    performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")@y.values[[1]]
    # [1] 0.03555556  that's abysmal!! 
    
    # test the model
    p.fit = predict(svm.fit, df1[-tr, -5])
    confusionMatrix(p.fit, df1[-tr, ]$Species)
    # 93% accuracy with NIR at 50%... the AUC score was not useful
    
    # check the trained model performance
    confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
    # 87%, with NIR at 50%... that's really good