python machine-learning scikit-learn cross-validation

Why does sklearn cross_validate() refit?

I understand why a tool like GridSearchCV refits. It explores over a range of hyperparameter values and after comparing scores, refits an estimator using the best found parameters on the whole dataset.

But whilst that makes sense, my question is about the cross_validate class where only a single set of hyperparameters is used. my undestanding of the purpose for which is to see how well the model generalises over different folds of train/test splits. Why is a refit used here?

I understand why n fits happen on the n folds of data. But according to the documentation, a refit also happens, as discussed in the error_score parameter:

error_score : ‘raise’ or numeric Value

to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

So on top of the n fits, there is an extra fit and I don't understand why this happens. There is no predict method for this class so even if it somehow differentiated between models and selected a 'best' one (despite them all having the exact same hyperparameters), there is no point in refitting.

To demonstrate this, I created a MLPRegressor model that I knew, combined with my dataset would have exploding gradients:

DL = MLPRegressor(
        hidden_layer_sizes=(200, 200, 200), activation='relu', max_iter=16,
            solver='sgd', learning_rate='invscaling', power_t=0.9)
DL.fit(df_training[predictor_cols], df_training[target_col])

The model fits without error (proving there are no NaN or inf values in my dataset) but does give the warning:

RuntimeWarning: overflow encountered in matmul

This evidences the exploding gradient and the output of any predictions is therefore NaN.

From my understanding of the cross_validate documentation, if I pass the following (with error_score=1):

DL = MLPRegressor(
        hidden_layer_sizes=(200, 200, 200), activation='relu', max_iter=16,
            solver='sgd', learning_rate='invscaling', power_t=0.9)

DL_CV = cross_validate(DL, df_training[predictor_cols], y=df_training[target_col], cv=None, n_jobs=1, pre_dispatch=5, return_train_score=False, return_estimator=True, error_score=1)

I should get the 'FitFailedWarning' message but no error. However, the training doesn't finish and instead the following error is raised:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

This therefore leads me to conclude that the error is due to the refit but I don't know what the purpose of refit is ...

Solution

cross_validate does not refit, as you can verify from the source code. The documentation is incorrect, presumably having been copied from that of GridSearchCV. You should open an Issue or make a pull request; if you'd rather not, I can.

I don't know the source of your final error though; perhaps the error gets raised while scoring a successfully fit model, rather than during fitting? If the original fit only raises a warning, then that won't get caught by default in the search.