cross-validation xgboost train-test-split

Does cross validation + early stopping show the actual performance for small sample?

I'm running xgboost on some simulation, where my sample size is 125. I was measuring the 5-fold cross validation error, i.e., in each round my training sample size is 100 and testing sample size is 25. Assuming all other parameters are fixed but the "n_estimators", i.e., the number of boosting rounds.

I have two options:

run the 5-fold cv for different n_estimators and do not use early stopping--in this case, I may choose the best n_estimator from the cv results;
further split the training sample into training (80) and validation (20), train the model on the 80 training observations and monitor early stopping on the 20 validation observations--in this case I may select a huge n_estimator and let it auto stop.

The questions are

In option 1, if I have another separate testing sample, can I use the 5 cross-validation models on the testing data and compute the average/ majority vote? Or do I need to train the model again with the best parameters on all 125 obs and make prediction on the testing set?
In option 2, is the 80 training obs enough to train the model/ 20 validation obs enough to monitor the performance? (in option 1 we also have a small sample size, but a little better)
Which option is better at comparing the xgboost model with other models?

Summary: what is the best way to choose a model for small sample size?

Solution

Using a very small amount of data as validation data has very high risk of overfitting, which is not recommended. Option 1 is better than option 2, and averaging over the cross-validations is a better choice than training the model again with the best parameters.

However, in this small sample case, the best way is to choose a simple model over complicated models.