Search code examples
rr-caretxgboostaucensemble-learning

Ensemble model predicting AUC 1


I'm trying to combine 3 models into an ensemble model:

  1. Model 1 - XGBoost
  2. Model 2 - RandomForest
  3. Model 3 - Logistic regression

Note: All the code here is using the caret package's train() function.

> Bayes_model

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75305, 75305, 75306, 75305, 75306, 75307, ... 
Resampling results:

  ROC        Sens  Spec
  0.5831236  1     0   

>linear_cv_model

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75306, 75305, 75305, 75306, 75306, 75305, ... 
Resampling results:

  ROC        Sens  Spec
  0.5776342  1     0   

>rf_model_best

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75305, 75305, 75306, 75305, 75306, 75307, ... 
Resampling results:

  ROC        Sens  Spec
  0.5551996  1     0   

Individually the 3 models have a very poor AUC in the 55-60 range, but are not extremely correlated so I hoped to ensemble them. Here is the basic code in R:

Bayes_pred = predict(Bayes_model,train,type="prob")[,2]
linear_pred = predict(linear_cv_model,train,type="prob")[,2]
rf_pred = predict(rf_model_best,train,type="prob")[,2]
stacked = cbind(Bayes_pred,linear_pred,rf_pred,train[,"target"])

So this results in a data frame with 4 columns, the three model predictions and the target. I thought the idea was to now run another meta model on these three predictors, but when I do so I get a AUC of 1 no matter what combination of XGBoost hyperparameters I try, so I know something is wrong.

Is this setup conceptually incorrect?

meta_model = train(target~ ., data = stacked,
               method = "xgbTree",
               metric = "ROC",
               trControl = trainControl(method = "cv",number = 10,classProbs = TRUE,
                                        summaryFunction = twoClassSummary
                                        ),
               na.action=na.pass,
               tuneGrid = grid
               )

Results:

>meta_model

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75306, 75306, 75307, 75305, 75306, 75305, ... 
Resampling results:

  ROC  Sens  Spec
  1    1     1   

I feel like with the CV folds a perfect AUC is definitely indicative of a data error. When trying logistic regression on this meta model I also get perfect separation. It just doesn't make sense.

> summary(stacked)
   Bayes_pred       linear_pred         rf_pred        Target
 Min.   :0.01867   Min.   :0.02679   Min.   :0.00000   No :74869  
 1st Qu.:0.08492   1st Qu.:0.08624   1st Qu.:0.01587   Yes: 8804  
 Median :0.10297   Median :0.10339   Median :0.04762              
 Mean   :0.10520   Mean   :0.10522   Mean   :0.11076              
 3rd Qu.:0.12312   3rd Qu.:0.12230   3rd Qu.:0.07937              
 Max.   :0.50483   Max.   :0.25703   Max.   :0.88889 

I know this isn't reproducible code, but I think it's an issue that isn't data set dependent. As shown above I have three predictions that not the same and certainly don't have great AUC values individually. Combined I should see some improvement but not perfect separation.


EDIT: Using the very helpful advice from T. Scharf, here is how I can grab the out of fold predictions to use in the meta model. The predictions will be stored in the model under "pred", but the predictions are not in the original order. You will need to reorder them to correctly stack.

Using dplyr's arrange() function, this is how I got the predictions for the Bayes' model:

Bayes_pred = arrange(as.data.frame(Bayes_model$pred)[,c("Yes","rowIndex")],rowIndex)[,1]

In my case, "Bayes_model" is the caret train object and "Yes" is the target class that I am modeling.


Solution

  • here's what is happening

    when you do this

    Bayes_pred = predict(Bayes_model,train,type="prob")[,2]
    linear_pred = predict(linear_cv_model,train,type="prob")[,2]
    rf_pred = predict(rf_model_best,train,type="prob")[,2]
    

    THIS IS THE PROBLEM

    You need out of fold predictions or test predictions as inputs to train the meta model.

    You are currently using the models you have trained, AND the data you trained them on. This will yield overly optimistic predictions, which are you are now feeding to the meta-model to train on.

    A good rule of thumb is to NEVER call predict on data with a model that has already seen that data, nothing good can happen.

    Here's what you need to do:

    When you train your initial 3 models, use method = cv and savePredictions = TRUE This will retain the out-of-fold predictions, which are usable to train the meta model.

    To convince yourself that your input data to the meta-model is wildly optimistic, calculate an individual AUC for the 3 columns of this object:

    stacked = cbind(Bayes_pred,linear_pred,rf_pred,train[,"target"])

    versus the target --- They will be really high, which is why your meta-model is so good. Its using overly good inputs.

    hope this helps, meta modeling is hard...