python scikit-learn cross-validation confusion-matrix k-fold

sklearn.model_selection.cross_val_score has different results from a manual calculation done on a confusion matrix

TL;DR When I calculate precision, recall, and f1 through CV cross_val_score(), it gives me different results than when I calculate through the confusion matrix. Why does it give different precision, recall, and f1 scores?

I'm learning SVM in machine learning and I wanted to compare the result returned by cross_val_score and the result I get from manually calculating the metrics from the confusion matrix. However, I have different result.

To start, I have written the code below using cross_val_score.

clf = svm.SVC()
kfold = KFold(n_splits = 10)

accuracy = metrics.make_scorer(metrics.accuracy_score)
precision = metrics.make_scorer(metrics.precision_score, average = 'macro')
recall = metrics.make_scorer(metrics.recall_score, average = 'macro')
f1 = metrics.make_scorer(metrics.f1_score, average = 'macro')

accuracy_score = cross_val_score(clf, X, y, scoring = accuracy, cv = kfold)
precision_score = cross_val_score(clf, X, y, scoring = precision, cv = kfold)
recall_score = cross_val_score(clf, X, y, scoring = recall, cv = kfold)
f1_score = cross_val_score(clf, X, y, scoring = f1, cv = kfold)

print("accuracy score:", accuracy_score.mean())
print("precision score:", precision_score.mean())
print("recall score:",recall_score.mean())
print("f1 score:", f1_score.mean())

The result for each metric is shown below:

accuracy score: 0.97
precision score: 0.96
recall score: 0.97
f1 score: 0.96

In addition, I created a Confusion Matrix so that I can manually calculate the accuracy, precision, recall, and f1 score based on the values on the matrix. I manually created the Confusion Matrix because I am using K-Fold Cross Validation. To do that, I have to get the actual classes and predicted classes for each iteration of the Cross Validation and so I have this code:

def cross_val_predict(model, kfold : KFold, X : np.array, y : np.array) -> Tuple[np.array, np.array]:
    
    model_ = cp.deepcopy(model)
    
    # gets the number of classes in the column/attribute
    no_of_classes = len(np.unique(y))
    
    # initializing empty numpy arrays to be returned
    actual_classes = np.empty([0], dtype = int)
    predicted_classes = np.empty([0], dtype = int)

    for train_index, test_index in kfold.split(X):

        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # append the actual classes for this iteration
        actual_classes = np.append(actual_classes, y_test)
        
        # fit the model
        model_.fit(X_train, y_train)
        
        # predict
        predicted_classes = np.append(predicted_classes, model_.predict(X_test))
        
    return actual_classes, predicted_classes

Afterwards, I created my confusion matrix after calling the above function.

actual_classes, predicted_classes = cross_val_predict(clf, kfold, X, y)
cm = metrics.confusion_matrix(y_true = actual_classes, y_pred = predicted_classes)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [2,4])
cm_display.plot()

Now, my confusion matrix looks like the below:

where: col is the predicted label, and row is the true label.

    |------|------|
 2  | 431  |  13  |
    |------|------|
 4  |  9   | 230  |
    |------|------|
       2      4

If I manually calcuate the accuracy, precision, recall, and f1 score from that matrix, I have the ff:

confusion matrix accuracy: 0.97
confusion matrix precision: 0.95
confusion matrix recall: 0.96
confusion matrix f1 score: 0.95

My question is that why did I get different result from manually calculating the metrics from the confusion matrix and the result from calling cross_val_score while specifying which scorer to use, i.e., [accuracy, precision, recall, fscore].

I hope you guys can help me understand why. Thank you very much for your responses!

Solution

With cross_val_score you take the mean of the metrics calculated over each fold, but when you do it manually you concatenate the predictions before calculating the scores. Because of that, the F1 score, precision are modified, while the accuracy and recall are not affected.

Accuracy

If n is the number of samples you have, and k the number of folds, then you can write:

From this equation, you can see that in the case where each fold is the same size, averaging the accuracies is equivalent to calculating the global average. However that is not true if some folds have a different size. But because the difference is only of one participant maximum, and one participant is often small compared to the dataset size, the difference between mean over folds and computed over the whole prediction accuracies is different.

Precision

Now let's consider precision (but this is also true for f1 score or ROC AUC which are calculated from both precision); which is the ratio of true positive/positive predictions

Let's say you have 3 folds:

Fold 1: 4 positive predictions, 1 true positive: precision=1/4
Fold 2: 2 positive predictions, 2 true positive: precision=1
Fold 3: 3 positive predictions, 1 true positive: precision=1/3

Now if you take the average you will get a precision of 19/36=0.527. However if you some the number of positive predictions and true positives you get 4/9=0.44 which is quite different.

The difference comes from the fact that the denominator, i.e. your number of positive predictions, is not constant over folds. Extreme values have more influence when averaging.

Recall

Recall is the rate of True positive/Positive samples. In the case where you stratified your k-fold, the averaged value should be the same as your concatenated metric.

Which to use?

This is a question I have not yet find any clear answer to. Most frameworks or definition use the averaging of the score, but I did not find any comparison with computing the metrics over all the predictions, so here are some personal observations:

computed the metric over each fold seems to be the most used method. The advantage is that you can get the mean, but also the std/quartiles of the metrics. With this you can really assess the performances of the learning process, on the condition that you have enough data.
for a low sample size, the evaluation folds can be really short (few samples). In that case the metrics are less stable (more prone to reach extreme values) and I find the average to still be sensible to extreme values. There I would recommend that you concatenate the predictions/targets to compute the score, but again this is just a personal observation and opinion