classification training-data train-test-split

Distribution of training, validation, and test set?

I want to ask about the distribution of train, validation, and test set? lets assume, i want to make a binary resnet classifier with two class of 'cat' and 'dog'.

Assume the name of the image each class is:

cat: a, b, c, d, e

dog: f, g, h, i, j

Assume that i want 2 images from each class to become test set and 1 image from each class to become validation set.

which scheme of distribution is actually right?

scheme 1

test set

cat: a, b

dog: i, j

train set

cat: c, d

dog: f, g

val set

cat: e

dog: h

scheme 2

test set

cat: a, b

dog: i, j

train set

cat: c, d, e

dog: f, g, h

val set

cat: e

dog: h

what makes me confuse, is the validation set is also a member of training set like in scheme 2?, or the validation set is separate/different from training set like in scheme 1 ? thanks for the help

Solution

Training, Validation, Testing Sets - Theses three sets have to be totally different. One cannot spill into the other during the execution of a single epoch.

Training Set is used to train the model, i.e., learn the weights.

Validation Set is used to fine-tune the hyperparameters depending on the performance. After a satisfactory model has been reached then the test set is brought into the picture.

Test Set is like a big surprise, the real showcase. It is not seen until the model is final. It helps in analysing the model learnt in a true sense.

How to separate them? It is good to randomly separate them and have equal distribution of each class in each of the sets.

For training you may also look into CrossValidation. It helps in removing the biased nature that may occur when training and validating on the same set of images.