I am fairly new to ml in R and a trying to build a 10-fold cross validated xgboost model. I have come across the docs here:
https://www.rdocumentation.org/packages/xgboost/versions/1.1.1.1/topics/xgb.cv
However in the example it purely uses training data and there is nowhere to specify a test set or test split. Does this make sense to anyone and can anyone that has used the package help me understand how exactly how this cross-validated model tests?
Thanks
The xgboost package allows you to choose whether you want to use the inbuilt cross-validation method or to specify your own cross-validation.
Of course you can do both and see the difference!
If you scan down the page that you linked for xgb.cv method to "Details" you will see some brief details of how you can extract information from the completed model.
The 10-fold cross-validation method means that internally the xgboost cv algorithm is doing successive splits of your data in the proportions 10% for testing to 90% for training so that all the data will in turn be used. This use of the algorithm makes and evaluates in effect ten different models and presents you with the results. You can adjust various hyper-parameters to improve your model either manually or through say a grid search.
If you want to do your own data split rather than use the inbuilt cross-validation method then use the "vanilla" form of the algorithm:
model <- xgboost(data = ......etc) # in R
An advantage I think of the xgb.cv formulation is that it gives you access to many more hyperparameters to tweak.
The plain xgboost(....) model using your own train/test split rather than the inbuilt cv version may be better or even essential in some cases for example where your data have a time-sensitive structure. Say you were interested in sales data over the past 10 years it may be better to take the first nine years data for training and use the last year as your test set.
What I did was to start with the "vanilla" formulation and build a model with default parameters. This became my baseline model for comparison purposes. Successive models of more complexity could be built and their performances compared to this baseline.