I have set up a very simple SVC to classify the MNIST digits. For some reason, the classifier is pretty consistently incorrectly predicting the digit 5, but when trying all other numbers it doesn't miss a single one. Does anyone have any idea if I might be setting this up wrong, or if it's just really bad at predicting the number 5?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
data = datasets.load_digits()
images = data.images
targets = data.target
# Split into train and test sets
images_train, images_test, imlabels_train, imlabels_test = train_test_split(images, targets, test_size=.2, shuffle=False)
# Re-shape data so that it's 2D
images_train = np.reshape(images_train, (np.shape(images_train)[0], 64))
images_test = np.reshape(images_test, (np.shape(images_test)[0], 64))
svm_classifier = SVC(gamma='auto').fit(images_train, imlabels_train)
number_correct_svc = 0
preds = []
for label_index in range(len(imlabels_test)):
pred = svm_classifier.predict(images_test[label_index].reshape(1,-1))
if pred[0] == imlabels_test[label_index]:
number_correct_svc += 1
preds.append(pred[0])
print("Support Vector Classifier...")
print(f"\tPercent correct for all test data: {100*number_correct_svc/len(imlabels_test)}%")
confusion_matrix(preds,imlabels_test)
Here is the resulting confusion matrix:
array([[22, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 15, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 15, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 21, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 21, 0, 0, 0, 0, 0],
[13, 21, 20, 16, 16, 37, 23, 20, 31, 16],
[ 0, 0, 0, 0, 0, 0, 14, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 16, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 21]], dtype=int64)
I've been reading the sklearn page for SVC but can't tell what I'm doing wrong
I tried using SCV(gamma='scale') and it seems much more reasonable. It would still be nice to know why 'auto' doesn't work? with scale:
array([[34, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 36, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 35, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 27, 0, 0, 0, 0, 0, 1],
[ 1, 0, 0, 0, 34, 0, 0, 0, 0, 0],
[ 0, 0, 0, 2, 0, 37, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 37, 0, 0, 0],
[ 0, 0, 0, 2, 0, 0, 0, 35, 0, 1],
[ 0, 0, 0, 6, 1, 0, 0, 1, 31, 1],
[ 0, 0, 0, 0, 2, 0, 0, 0, 1, 33]], dtype=int64)
The second question is much easier to deal with. The thing is in RBF kernel the gamma denotes how wiggly the decision boundary would be. What do we mean by "wiggly"? The higher the value of gamma more precise the decision boundary would be. Decision boundary of the SVM.
if
gamma='scale'
(default) is passed then it uses1 / (n_features *X.var())
as value of gamma,if ‘auto’, uses
1 / n_features
.
In the second case the gamma is higher. For MNIST standard deviation is less than 1. As a result the second decision boundary is much more precise giving a better result than the previous case.