Search code examples
machine-learningdata-sciencecross-validation

Cross validation and Improvement


I was wondering how the cross validation process can improve a model. I am totally new to this field and keen to learn. I understood the principle of cross-validation but don't understand how it improves a model. Let's say the model is divided into 4 folds than if I train my model on the 3 first fourth and test on the last one the model is gonna train well. But when I repeat this step by training the model on the last 3 fourth and test on the first one, most of the training data has already been "reviewed" by the model? The model won't improve with data already seen right? Is it a "mean" of the models made with the different training data sets?

Thank you in advance for your time!


Solution

  • Cross validation doesn't actually improve the model, but helps you to accurately score it's performance.

    Let's say at the beginning of your training you divide your data into 80% train and 20% test sets. Then you train on the said 80% and test on 20% and get the performance metric.

    The problem is, when separating the data in the beginning, you did so hopefully randomly, or otherwise arbitrary, and as a result, the model performance you obtained is somehow relying on the pseudo-random number generator you've used or your judgement.

    So instead you divide your data into, for example, 5 random equal sets. Then you take set 1, put it aside, train on sets 2-5, test on set 1 and record the performance metric. Then you put aside set 2, and train a fresh (not trained) model on sets 1, 3-5, test on set 2, record the metric and so on.

    After 5 sets you will have 5 performance metrics. If you take their average (of the most appropriate kind) it would be a better representation of your model performance, because you are 'averaging out' the random effects of data splitting.

    I think it is explained well in this blog with some code in Python.