Search code examples
machine-learningkerasdatasetcross-validationtrain-test-split

Difference between doing cross-validation and validation_data/validation_split in Keras


First, I split the dataset into train and test, for example:

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)

I then use GridSearchCV with cross-validation to find the best performing model:

validator  = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)

And by doing this, I have:

A model is trained using k-1 of the folds as training data; the resulting model is validated on the remaining part of the data (scikit-learn.org)

But then, when reading about Keras fit fuction, the document introduces 2 more terms:

validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

validation_data: tuple (x_val, y_val) or tuple (x_val, y_val, val_sample_weights) on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. validation_data will override validation_split.

From what I understand, validation_split (to be overridden by validation_data) will be used as an unchanged validation dataset, meanwhile hold-out set in cross-validation changes during each cross-validation step.

  • First question: is it necessary to use validation_split or validation_data since I already do cross validation?
  • Second question: if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?

    grid_result = validator.fit(train_images, train_labels, validation_data=None, validation_split=0)
    
  • Question 3: If I do so, what will happen during the training, would Keras just simply ignore the validation step?

  • Question 4: Does the validation_split belong to k-1 folds or the hold-out fold, or will it be considered as "test set" (like in the case of cross validation) which will never be used to train the model.


Solution

  • Validation is performed to ensure that the model is not overfitting on the dataset and it would generalize to new data. Since in the parameters grid search you are also doing validation then there is no need to perform the validation step by the Keras model itself during training. Therefore to answer your questions:

    is it necessary to use validation_split or validation_data since I already do cross validation?

    No, as I mentioned above.

    if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?

    No, since by default no validation is done in Keras (i.e. by default we have validation_split=0.0, validation_data=None in fit() method).

    If I do so, what will happen during the training, would Keras just simply ignore the validation step?

    Yes, Keras won't perform the validation when training the model. However note that, as I mentioned above, the grid search procedure would perform validation to better estimate the performance of the model with a specific set of parameters.