TL;DR When I calculate precision, recall, and f1 through CV cross_val_score()
, it gives me different results than when I calculate through the confusion matrix. Why does it give different precision, recall, and f1 scores?
I'm learning SVM in machine learning and I wanted to compare the result returned by cross_val_score
and the result I get from manually calculating the metrics from the confusion matrix. However, I have different result.
To start, I have written the code below using cross_val_score
.
clf = svm.SVC()
kfold = KFold(n_splits = 10)
accuracy = metrics.make_scorer(metrics.accuracy_score)
precision = metrics.make_scorer(metrics.precision_score, average = 'macro')
recall = metrics.make_scorer(metrics.recall_score, average = 'macro')
f1 = metrics.make_scorer(metrics.f1_score, average = 'macro')
accuracy_score = cross_val_score(clf, X, y, scoring = accuracy, cv = kfold)
precision_score = cross_val_score(clf, X, y, scoring = precision, cv = kfold)
recall_score = cross_val_score(clf, X, y, scoring = recall, cv = kfold)
f1_score = cross_val_score(clf, X, y, scoring = f1, cv = kfold)
print("accuracy score:", accuracy_score.mean())
print("precision score:", precision_score.mean())
print("recall score:",recall_score.mean())
print("f1 score:", f1_score.mean())
The result for each metric is shown below:
accuracy score: 0.97
precision score: 0.96
recall score: 0.97
f1 score: 0.96
In addition, I created a Confusion Matrix so that I can manually calculate the accuracy, precision, recall, and f1 score based on the values on the matrix. I manually created the Confusion Matrix because I am using K-Fold Cross Validation. To do that, I have to get the actual classes and predicted classes for each iteration of the Cross Validation and so I have this code:
def cross_val_predict(model, kfold : KFold, X : np.array, y : np.array) -> Tuple[np.array, np.array]:
model_ = cp.deepcopy(model)
# gets the number of classes in the column/attribute
no_of_classes = len(np.unique(y))
# initializing empty numpy arrays to be returned
actual_classes = np.empty([0], dtype = int)
predicted_classes = np.empty([0], dtype = int)
for train_index, test_index in kfold.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# append the actual classes for this iteration
actual_classes = np.append(actual_classes, y_test)
# fit the model
model_.fit(X_train, y_train)
# predict
predicted_classes = np.append(predicted_classes, model_.predict(X_test))
return actual_classes, predicted_classes
Afterwards, I created my confusion matrix after calling the above function.
actual_classes, predicted_classes = cross_val_predict(clf, kfold, X, y)
cm = metrics.confusion_matrix(y_true = actual_classes, y_pred = predicted_classes)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [2,4])
cm_display.plot()
Now, my confusion matrix looks like the below:
where: col is the predicted label, and row is the true label.
|------|------|
2 | 431 | 13 |
|------|------|
4 | 9 | 230 |
|------|------|
2 4
If I manually calcuate the accuracy, precision, recall, and f1 score from that matrix, I have the ff:
confusion matrix accuracy: 0.97
confusion matrix precision: 0.95
confusion matrix recall: 0.96
confusion matrix f1 score: 0.95
My question is that why did I get different result from manually calculating the metrics from the confusion matrix and the result from calling cross_val_score
while specifying which scorer to use, i.e., [accuracy, precision, recall, fscore].
I hope you guys can help me understand why. Thank you very much for your responses!
With cross_val_score
you take the mean of the metrics calculated over each fold, but when you do it manually you concatenate the predictions before calculating the scores. Because of that, the F1 score, precision are modified, while the accuracy and recall are not affected.
If n is the number of samples you have, and k the number of folds, then you can write:
From this equation, you can see that in the case where each fold is the same size, averaging the accuracies is equivalent to calculating the global average. However that is not true if some folds have a different size. But because the difference is only of one participant maximum, and one participant is often small compared to the dataset size, the difference between mean over folds and computed over the whole prediction accuracies is different.
Now let's consider precision (but this is also true for f1 score or ROC AUC which are calculated from both precision); which is the ratio of true positive/positive predictions
Let's say you have 3 folds:
Fold 1: 4 positive predictions, 1 true positive: precision=1/4
Fold 2: 2 positive predictions, 2 true positive: precision=1
Fold 3: 3 positive predictions, 1 true positive: precision=1/3
Now if you take the average you will get a precision of 19/36=0.527. However if you some the number of positive predictions and true positives you get 4/9=0.44 which is quite different.
The difference comes from the fact that the denominator, i.e. your number of positive predictions, is not constant over folds. Extreme values have more influence when averaging.
Recall is the rate of True positive/Positive samples. In the case where you stratified your k-fold, the averaged value should be the same as your concatenated metric.
This is a question I have not yet find any clear answer to. Most frameworks or definition use the averaging of the score, but I did not find any comparison with computing the metrics over all the predictions, so here are some personal observations: