machine-learning data-science cross-validation

Cross validation and Improvement

I was wondering how the cross validation process can improve a model. I am totally new to this field and keen to learn. I understood the principle of cross-validation but don't understand how it improves a model. Let's say the model is divided into 4 folds than if I train my model on the 3 first fourth and test on the last one the model is gonna train well. But when I repeat this step by training the model on the last 3 fourth and test on the first one, most of the training data has already been "reviewed" by the model? The model won't improve with data already seen right? Is it a "mean" of the models made with the different training data sets?

Thank you in advance for your time!

Solution

Cross validation doesn't actually improve the model, but helps you to accurately score it's performance.

Let's say at the beginning of your training you divide your data into 80% train and 20% test sets. Then you train on the said 80% and test on 20% and get the performance metric.

The problem is, when separating the data in the beginning, you did so hopefully randomly, or otherwise arbitrary, and as a result, the model performance you obtained is somehow relying on the pseudo-random number generator you've used or your judgement.

So instead you divide your data into, for example, 5 random equal sets. Then you take set 1, put it aside, train on sets 2-5, test on set 1 and record the performance metric. Then you put aside set 2, and train a fresh (not trained) model on sets 1, 3-5, test on set 2, record the metric and so on.

After 5 sets you will have 5 performance metrics. If you take their average (of the most appropriate kind) it would be a better representation of your model performance, because you are 'averaging out' the random effects of data splitting.

I think it is explained well in this blog with some code in Python.