Search code examples
pythonmachine-learningkeras

Is it ok to have the training history very similar to the validation history?


I trained a model for 50 epochs splitting the dataset with the following proportion:

  • X_train, Y_train = 70%
  • X_validation, Y_validation = 20%
  • X_test, Y_test = 10%

All the splitting are done using the train_test_split(shuffle=True) keras function:

X = np.load(....)
Y = np.load(....)

# Split on training and validation
N_validation = int(len(X) * 0.2)
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=N_validation)

# Split Train data once more for Test data
N_test = int(len(X_train) * 0.1)
X_train, X_test, Y_train, Y_test = train_test_split(X_train, Y_train, test_size=N_test)

Here is the history plot.

As you can see from the history, the validation accuracy/loss is very similar to the training accuracy/loss. Sometimes the validation loss is even lower than the training loss. As for this last statement, I read here that this could be caused to an high dropout value. This could be the case since I have a dropout layer with rate=0.3. What I didn't understand is whether this is a problem or not.

Testing the model on the Test set, I have an accuracy of 91%.


Solution

  • Conclusively, this doesn't make a problem.This is a rather good phenomenon. Machine learning pipelines aim at acquiring a favorable test set accuracy. There are two situations that the test set accuracy is insufficient.

    Underfitting is when the model is not complex enough to map the dependencies of the input and output, and fails at both the training and validation dataset with high loss and low accuracy.

    Overfitting is when the model performs well in the training set, but performs badly for the test set. This is the case when what you mentioned: 'validation loss is even lower than the training loss' occurs. Overfitting is often resolved by simplifying the model complexity by using methods such as dropout.

    More information on underfitting/overfitting and ways to resolve them can be found easily in blogs. It is a good signal that the your model performs well in both the train/validation dataset.

    However, one concern is that you might be mixing up the train/test dataset while training, if you are shuffling the data or didn't set a constant random seed for multiple times of splitting. If its not the case, dont worry!