python machine-learning scikit-learn metrics cross-validation

How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict?

How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict (used to obtain predictions to be then given to a metric function)?

Here is an example:

from sklearn import cross_validation
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB


iris = datasets.load_iris()

gnb_clf = GaussianNB()
#  compute mean accuracy with cross_val_predict
predicted = cross_validation.cross_val_predict(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvp = metrics.accuracy_score(iris.target, predicted)
#  compute mean accuracy with cross_val_score
score_cvs = cross_validation.cross_val_score(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvs = score_cvs.mean()

print('Accuracy cvp: %0.8f\nAccuracy cvs: %0.8f' % (accuracy_cvp, accuracy_cvs))

In this case, we obtain the same result:

Accuracy cvp: 0.95333333
Accuracy cvs: 0.95333333

Nevertheless, this seems not to be always the case, as on the official documentation it is written (regarding a result computed using cross_val_predict):

Note that the result of this computation may be slightly different from those obtained using cross_val_score as the elements are grouped in different ways.

Solution

Imagine following labels and splitting

[010|101|10]

So you have 8 data points, 4 per class and you split it to 3 folds, leading to 2 folds with 3 elements and one with 2. Now let us assume that during cross validation you get following preds

[010|100|00]

thus, your scores are [100%, 67%, 50%], and cross val score (as an average) is around 72%. Now what about accuracy over predictions? You clearly have 6/8 things right, thus 75%. As you can see the scores are different, even thoug they both rely on cross validation. Here, the difference arises because the splits are not exactly the same size, thus this last "50%" is actually lowering total score because it is an avergae over just 2 samples (and the rest are based on 3).

There might be other similar phenomena, in general - it should boil down to the way averaging is computed. Thus - cross val score is an average over averages, which does not have to be an average over cross validation predictions.