Search code examples
rr-caretxgboosthyperparameters

{caret}xgTree: There were missing values in resampled performance measures


I'm attempting to run a 5-fold XGBoost model on this dataset. When I run the following code:

  train_control<- trainControl(method="cv", 
                           search = "random", 
                           number=5,
                           verboseIter=TRUE)

  # Train Models 
  xgb.mod<- train(Vote_perc~.,
              data=forkfold, 
              trControl=train_control, 
              method="xgbTree", 
              family=binomial())

I receive a warning of:

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

Furthermore, the "predict" function runs, but all predictions were the same number. I suspect it's an intercept-only model, but I'm not sure. Also when I remove the

search="random"

argument, it runs properly. I want to run random searches so that I can isolate what hyperparameters might be most effective, but everytime I try, I get that warning. What am I missing? Thank you!


Solution

  • Here is one approach you could perform with your data:

    load data:

    forkfold  <- read.csv("forkfold.csv", row.names = 1)
    

    the problem here is that the outcome variable is 0 in 97% of the cases while in the remaining 3% it is very close to zero.

    length(forkfold$Vote_perc)
    #output
    7069
    
    sum(forkfold$Vote_perc != 0)
    #output 
    212
    

    You described it a classification problem and I will treat it as such by converting it to a binary problem:

    forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
                                 "one",
                                 "zero")
    

    Since the set is highly imbalanced using Accuracy as selection metric is out of the question. Here i will try to maximize Sensitivity + Specificity as described here by defining a custom evaluation function:

    fourStats <- function (data, lev = levels(data$obs), model = NULL) {
      out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
      coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), 
                       ncol = 2, 
                       byrow = TRUE)
      colnames(coords) <- c("Spec", "Sens")
      rownames(coords) <- c("Best", "Current")
      c(out, Dist = dist(coords)[1])
    }
    

    I will specify this function in trainControl:

    train_control <- trainControl(method = "cv", 
                                  search = "random", 
                                  number = 5,
                                  verboseIter=TRUE,
                                  classProbs = T,
                                  savePredictions = "final",
                                  summaryFunction = fourStats)
    
    set.seed(1)
    xgb.mod <- train(Vote_perc~.,
                     data = forkfold, 
                     trControl = train_control, 
                     method = "xgbTree", 
                     tuneLength = 50,
                     metric = "Dist",
                     maximize = FALSE,
                     scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))
    

    I will use the before defined Dist metric in the fourStats summary function. This metric should be minimized so maximize = FALSE. I will use a random search over the tune space and 50 random sets of hyper parameter values will be tested (tuneLength = 50).

    I also set scale_pos_weight parameter of the xgboost function. From the help of ?xgboost:

    scale_pos_weight, [default=1] Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3

    I defined it as recommended sum(negative cases) / sum(positive cases)

    After the model trains it will pick some hype parameters that minimize Dist.

    To evaluate the confusion matrix on the hold out predictions:

    caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)
    
    Confusion Matrix and Statistics
    
              Reference
    Prediction  one zero
          one   195  430
          zero   17 6427
    
                   Accuracy : 0.9368          
                     95% CI : (0.9308, 0.9423)
        No Information Rate : 0.97            
        P-Value [Acc > NIR] : 1               
    
                      Kappa : 0.4409          
     Mcnemar's Test P-Value : <2e-16          
    
                Sensitivity : 0.91981         
                Specificity : 0.93729         
             Pos Pred Value : 0.31200         
             Neg Pred Value : 0.99736         
                 Prevalence : 0.02999         
             Detection Rate : 0.02759         
       Detection Prevalence : 0.08841         
          Balanced Accuracy : 0.92855         
    
           'Positive' Class : one 
    

    I'd say its not that bad.

    You can do better if you tune the cutoff threshold of predictions, how to do this during the tuning process is described here. You can also use the out of fold predictions for tuning the cutoff threshold. Here I will show how to use pROC library for it:

    library(pROC)
    
    plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
         print.thres = TRUE)
    

    enter image description here

    The threshold shown on the image maximizes Sens + Spec:

    to evaluate the out of fold performance using this threshold:

    caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
                           xgb.mod$pred$obs)
    #output
    Confusion Matrix and Statistics
    
              Reference
    Prediction  one zero
          one   200  596
          zero   12 6261
    
                   Accuracy : 0.914           
                     95% CI : (0.9072, 0.9204)
        No Information Rate : 0.97            
        P-Value [Acc > NIR] : 1               
    
                      Kappa : 0.3668          
     Mcnemar's Test P-Value : <2e-16          
    
                Sensitivity : 0.94340         
                Specificity : 0.91308         
             Pos Pred Value : 0.25126         
             Neg Pred Value : 0.99809         
                 Prevalence : 0.02999         
             Detection Rate : 0.02829         
       Detection Prevalence : 0.11260         
          Balanced Accuracy : 0.92824         
    
           'Positive' Class : one 
    

    So out of 212 non zero entities you detected 200.

    To perform better you may try to pre process the data. OR use a better hyper parameter search routine like mlrMBO package intended for use with mlr. Or perhaps change the learner (I doubt you can top xgboost here tho).

    One more note, if it is not paramount to get a high Sensitivity perhaps using "Kappa" as selection metric might provide a more satisfying model.

    As a final note lets check the performance of the model with the default scale_pos_weight = 1, using the already selected parameters:

    set.seed(1)
    xgb.mod2 <- train(Vote_perc~.,
                      data = forkfold, 
                      trControl = train_control, 
                      method = "xgbTree", 
                      tuneGrid = data.frame(nrounds = 498,
                                            max_depth = 3,
                                            eta = 0.008833468,
                                            gamma = 4.131242,
                                            colsample_bytree = 0.4233169,
                                            min_child_weight = 3,
                                            subsample = 0.6212512),
                      metric = "Dist",
                      maximize = FALSE,
                      scale_pos_weight = 1)
    
    caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
    #output
    Confusion Matrix and Statistics
    
              Reference
    Prediction  one zero
          one    94   21
          zero  118 6836
    
                   Accuracy : 0.9803          
                     95% CI : (0.9768, 0.9834)
        No Information Rate : 0.97            
        P-Value [Acc > NIR] : 3.870e-08       
    
                      Kappa : 0.5658          
     Mcnemar's Test P-Value : 3.868e-16       
    
                Sensitivity : 0.44340         
                Specificity : 0.99694         
             Pos Pred Value : 0.81739         
             Neg Pred Value : 0.98303         
                 Prevalence : 0.02999         
             Detection Rate : 0.01330         
       Detection Prevalence : 0.01627         
          Balanced Accuracy : 0.72017         
    
           'Positive' Class : one  
    

    So much worse at the default threshold of 0.5.

    and the optimal threshold value:

    plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
         print.thres = TRUE)
    

    enter image description here

    0.037 compared to the 0.369 obtained when we set scale_pos_weight as recommended. However with the optimal threshold both approaches yield identical predictions.