Search code examples
cross-validationxgboosttrain-test-split

Does cross validation + early stopping show the actual performance for small sample?


I'm running xgboost on some simulation, where my sample size is 125. I was measuring the 5-fold cross validation error, i.e., in each round my training sample size is 100 and testing sample size is 25. Assuming all other parameters are fixed but the "n_estimators", i.e., the number of boosting rounds.

I have two options:

  • run the 5-fold cv for different n_estimators and do not use early stopping--in this case, I may choose the best n_estimator from the cv results;

  • further split the training sample into training (80) and validation (20), train the model on the 80 training observations and monitor early stopping on the 20 validation observations--in this case I may select a huge n_estimator and let it auto stop.

The questions are

  • In option 1, if I have another separate testing sample, can I use the 5 cross-validation models on the testing data and compute the average/ majority vote? Or do I need to train the model again with the best parameters on all 125 obs and make prediction on the testing set?

  • In option 2, is the 80 training obs enough to train the model/ 20 validation obs enough to monitor the performance? (in option 1 we also have a small sample size, but a little better)

  • Which option is better at comparing the xgboost model with other models?

Summary: what is the best way to choose a model for small sample size?


Solution

  • Using a very small amount of data as validation data has very high risk of overfitting, which is not recommended. Option 1 is better than option 2, and averaging over the cross-validations is a better choice than training the model again with the best parameters.

    However, in this small sample case, the best way is to choose a simple model over complicated models.