Search code examples
machine-learningcross-validation

Cross Validation on Train, Validation & Test Set


In the scenario of having three sets

  • A train set of e.g. 80% (for model training)
  • A validation set of e.g. 10% (for model training)
  • A test set of e.g. 10% (for final model testing)

let's say I perform k-fold cross validation (CV) on the example dataset of [1,2,3,4,5,6,7,8,9,10]. Let's also say

  • 10 is the test set in this example
  • the remaining [1,2,3,4,5,6,7,8,9] will be used for training and validation

leave-one-out CV would than look something like this

# Fold 1
[2, 3, 4, 5, 6, 7, 8, 9] # train
[1]                      # validation
# Fold 2
[1, 3, 4, 5, 6, 7, 8, 9] # train
[2]                      # validation
# Fold 3
[1, 2, 4, 5, 6, 7, 8, 9] # train
[3]                      # validation
# Fold 4
[1, 2, 3, 5, 6, 7, 8, 9] # train
[4]                      # validation
# Fold 5
[1, 2, 3, 4, 6, 7, 8, 9] # train
[5]                      # validation
# Fold 6
[1, 2, 3, 4, 5, 7, 8, 9] # train
[6]                      # validation
# Fold 7
[1, 2, 3, 4, 5, 6, 8, 9] # train
[7]                      # validation
# Fold 8
[1, 2, 3, 4, 5, 6, 7, 9] # train
[8]                      # validation
# Fold 9
[1, 2, 3, 4, 5, 6, 7, 8] # train
[9]                      # validation

Great, now the model has been built and validation using each data point of the combined train and validation set once.

Next, I would test my model on the test set (10) and get some performance.

What I was wondering now is why we not also perform CV using the test set and average the result to see the impact of different test sets? Meaning why we don't do the above process 10 times such that we have each data point also in the test set?

It would be obviously computationally extremely expensive but I was thinking about that cause it seemed difficult to choose an appropriate test set. For example, it could be that my model from above would have performed much differently when I would have chosen 1 as the test set and trained and validated on the remaining points.

I wondered about this in scenarios where I have groups in my data. For example

  • [1,2,3,4] comes from group A,
  • [5,6,7,8] comes from group B and
  • [9,10] comes from group C.

In this case when choosing 10 as the test set, it could perform much differently than choosing 1 right, or am I missing something here?


Solution

  • All your train-validation-test splits should be randomly sampled and sufficiently big. Hence if your data comes from different groups you should have roughly the same distribution of groups across train, validation and test pools. If your test performance varies based on the sampling seed you're definitely doing something wrong.

    As to why not use test set for cross-validation, this would result in overfitting. Usually you would run your cross-validation many times with different hyperparameters and use cv score to select best models. If you don't have a separate test set to evaluate your model at the end of model selection you would never know if you overfitted to the training pool during model selection iterations.