Search code examples
pythonscikit-learncross-validation

Cross-validation in scikit-learn: mean absolute error of (X_test, y_test)


Usually we split the original feature and target data (X,y) in (X_train, y_train) and (X_test, y_test).

By using the method:

mae_A = cross_val_score(clf, X_train_scaled, y_train, scoring="neg_mean_absolute_error", cv=kfold)

I get the cross validation Mean Absolute Error (MAE) for the (X_train, y_train), right?

How can I get the MAE (from the previous cross-validation models got by using (X_train, y_train)) for the (X_test, y_test)?


Solution

  • This is the correct approach. As a rule, you should only train your model using training data. Thus the test_set should remain unseen in the cross-validation process, i.e. by the model's hyperparameters, otherwise you could be biasing the results obtained from the model by adding knowledge from the test sample.

    I get the cross validation Mean Absolute Error (MAE) for the (X_train, y_train), right?

    Yes, the error displayed by cross_val_score will be only from the training data. So the idea is that once you are satisfied with the results of cross_val_score, you fit the final model with the whole training set, and perform a prediction on y_test. For that you could use sklearn.metrics. For isntance, if you wanted to obtain the MAE:

    from sklearn.metrics import mean_absolute_error as mae
    MAE = mae(y_test, y_pred)