Search code examples
pythonmachine-learningscikit-learncross-validation

scikit-learn scores are different when using cross_val_predict vs cross_val_score


I expected both methods to return rather similar errors, can someone point me to the mistake please?

Calculating RMSE...

rf = RandomForestRegressor(random_state=555, n_estimators=100, max_depth=8)
rf_preds = cross_val_predict(rf, train_, targets, cv=7, n_jobs=7) 
print("RMSE Score using cv preds: {:0.5f}".format(metrics.mean_squared_error(targets, rf_preds, squared=False)))

scores = cross_val_score(rf, train_, targets, cv=7, scoring='neg_root_mean_squared_error', n_jobs=7)
print("RMSE Score using cv_score: {:0.5f}".format(scores.mean() * -1))

RMSE Score using cv preds: 0.01658
RMSE Score using cv_score: 0.01073

Solution

  • There are two issues here, both of which are mentioned in the documentation of cross_val_predict:

    Results can differ from cross_validate and cross_val_score unless all tests sets have equal size and the metric decomposes over samples.

    The first is to make all sets (training and test) the same in both cases, which is not the case in your example. To do so, we need to employ the kfold method in order to define our CV folds, and then use these same folds in both cases. Here is an example with dummy data:

    from sklearn.datasets import make_regression
    from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error
    
    X, y = make_regression(n_samples=2000, n_features=4, n_informative=2,
                          random_state=42, shuffle=False)
    
    rf = RandomForestRegressor(max_depth=2, random_state=0)
    kf = KFold(n_splits=5)
    
    rf_preds = cross_val_predict(rf, X, y, cv=kf, n_jobs=5) 
    print("RMSE Score using cv preds: {:0.5f}".format(mean_squared_error(y, rf_preds, squared=False)))
    
    scores = cross_val_score(rf, X, y, cv=kf, scoring='neg_root_mean_squared_error', n_jobs=5)
    print("RMSE Score using cv_score: {:0.5f}".format(scores.mean() * -1))
    

    The result of the above code snippet (fully reproducible, since we have explicitly set all the necessary random seeds) is:

    RMSE Score using cv preds: 15.16839
    RMSE Score using cv_score: 15.16031
    

    So, we can see that the two scores are indeed similar, but still not identical.

    Why is that? The answer lies in the rather cryptic second part of the quoted sentence above, i.e. the RMSE score does not decompose over samples (to be honest, I don't know any ML score that it does).

    In simple words, while cross_val_predict computes the RMSE strictly according to its definition, i.e. (pseudocode):

    RMSE = square_root([(y[1] - y_pred[1])^2 + (y[2] - y_pred[2])^2 + ... + (y[n] - y_pred[n])^2]/n)
    

    where n is the number of samples, the cross_val_score method does not do exactly that; what it does instead is that it computes the RMSE for each one of the k CV folds, and then averages these k values, i.e. (pseudocode again):

    RMSE = (RMSE[1] + RMSE[2] + ... + RMSE[k])/k
    

    And exactly because the RMSE is not decomposable over the samples, these two values, although close, are not identical.

    We can actually demonstrate that this is the case indeed, by doing the CV procedure manually and emulating the RMSE calculation as done by cross_val_score and described above, i.e.:

    import numpy as np
    RMSE__cv_score = []
    
    for train_index, val_index in kf.split(X):
        rf.fit(X[train_index], y[train_index])
        pred = rf.predict(X[val_index])
        err = mean_squared_error(y[val_index], pred, squared=False)
        RMSE__cv_score.append(err)
    
    print("RMSE Score using manual cv_score: {:0.5f}".format(np.mean(RMSE__cv_score)))
    

    The result being:

    RMSE Score using manual cv_score: 15.16031
    

    i.e. identical with the one returned by cross_val_score above.

    So, if we want to be very precise, the truth is that the correct RMSE (i.e. calculated exactly according to its definition) is the one returned by cross_val_predict; cross_val_score returns an approximation of it. But in practice, we often find that the difference is not that significant, so we can also use cross_val_score if it is more convenient.