I'm trying to combine 3 models into an ensemble model:
Note: All the code here is using the caret package's train() function.
> Bayes_model
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 75305, 75305, 75306, 75305, 75306, 75307, ...
Resampling results:
ROC Sens Spec
0.5831236 1 0
>linear_cv_model
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 75306, 75305, 75305, 75306, 75306, 75305, ...
Resampling results:
ROC Sens Spec
0.5776342 1 0
>rf_model_best
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 75305, 75305, 75306, 75305, 75306, 75307, ...
Resampling results:
ROC Sens Spec
0.5551996 1 0
Individually the 3 models have a very poor AUC in the 55-60 range, but are not extremely correlated so I hoped to ensemble them. Here is the basic code in R:
Bayes_pred = predict(Bayes_model,train,type="prob")[,2]
linear_pred = predict(linear_cv_model,train,type="prob")[,2]
rf_pred = predict(rf_model_best,train,type="prob")[,2]
stacked = cbind(Bayes_pred,linear_pred,rf_pred,train[,"target"])
So this results in a data frame with 4 columns, the three model predictions and the target. I thought the idea was to now run another meta model on these three predictors, but when I do so I get a AUC of 1 no matter what combination of XGBoost hyperparameters I try, so I know something is wrong.
Is this setup conceptually incorrect?
meta_model = train(target~ ., data = stacked,
method = "xgbTree",
metric = "ROC",
trControl = trainControl(method = "cv",number = 10,classProbs = TRUE,
summaryFunction = twoClassSummary
),
na.action=na.pass,
tuneGrid = grid
)
Results:
>meta_model
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 75306, 75306, 75307, 75305, 75306, 75305, ...
Resampling results:
ROC Sens Spec
1 1 1
I feel like with the CV folds a perfect AUC is definitely indicative of a data error. When trying logistic regression on this meta model I also get perfect separation. It just doesn't make sense.
> summary(stacked)
Bayes_pred linear_pred rf_pred Target
Min. :0.01867 Min. :0.02679 Min. :0.00000 No :74869
1st Qu.:0.08492 1st Qu.:0.08624 1st Qu.:0.01587 Yes: 8804
Median :0.10297 Median :0.10339 Median :0.04762
Mean :0.10520 Mean :0.10522 Mean :0.11076
3rd Qu.:0.12312 3rd Qu.:0.12230 3rd Qu.:0.07937
Max. :0.50483 Max. :0.25703 Max. :0.88889
I know this isn't reproducible code, but I think it's an issue that isn't data set dependent. As shown above I have three predictions that not the same and certainly don't have great AUC values individually. Combined I should see some improvement but not perfect separation.
EDIT: Using the very helpful advice from T. Scharf, here is how I can grab the out of fold predictions to use in the meta model. The predictions will be stored in the model under "pred", but the predictions are not in the original order. You will need to reorder them to correctly stack.
Using dplyr's arrange() function, this is how I got the predictions for the Bayes' model:
Bayes_pred = arrange(as.data.frame(Bayes_model$pred)[,c("Yes","rowIndex")],rowIndex)[,1]
In my case, "Bayes_model" is the caret train object and "Yes" is the target class that I am modeling.
here's what is happening
when you do this
Bayes_pred = predict(Bayes_model,train,type="prob")[,2]
linear_pred = predict(linear_cv_model,train,type="prob")[,2]
rf_pred = predict(rf_model_best,train,type="prob")[,2]
THIS IS THE PROBLEM
You need out of fold predictions or test predictions as inputs to train the meta model.
You are currently using the models you have trained, AND the data you trained them on. This will yield overly optimistic predictions, which are you are now feeding to the meta-model to train on.
A good rule of thumb is to NEVER call predict on data with a model that has already seen that data, nothing good can happen.
Here's what you need to do:
When you train your initial 3 models, use method = cv
and savePredictions = TRUE
This will retain the out-of-fold predictions, which are usable to train the meta model.
To convince yourself that your input data to the meta-model is wildly optimistic, calculate an individual AUC
for the 3 columns of this object:
stacked = cbind(Bayes_pred,linear_pred,rf_pred,train[,"target"])
versus the target --- They will be really high, which is why your meta-model is so good. Its using overly good inputs.
hope this helps, meta modeling is hard...