python scikit-learn svm multiclass-classification

Something wrong when implementing SVM One-vs-all in python

I was trying to verify that I had correctly understood how SVM - OVA (One-versus-All) works, by comparing the function OneVsRestClassifier with my own implementation.

In the following code, I implemented num_classes classifiers in the training phase, and then tested all of them on the testset and selected the one returning the highest probability value.

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale

# Read dataset 
df = pd.read_csv('In/winequality-white.csv',  delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)

# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]

# From dataset to features and labels
X_train = train.loc[:, train.columns != 'quality']
Y_train = train.loc[:, train.columns == 'quality']
X_test = test.loc[:, test.columns != 'quality']
Y_test = test.loc[:, test.columns == 'quality']

# Models
clf =  [None] * num_classes
for k in np.arange(0,num_classes):
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced', probability=True)
    clf[k] = my_model.fit(X_train, Y_train==my_classes[k])

# Prediction
prob_table = np.zeros((len(Y_test), num_classes))
for k in np.arange(0,num_classes):
    p = clf[k].predict_proba(X_test)
    prob_table[:,k] = p[:,list(clf[k].classes_).index(True)]
Y_pred = prob_table.argmax(axis=1)

print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")

Test accuracy is equal to 0.21, while when using the function OneVsRestClassifier, it returns 0.59. For completeness, I also report the other code (the pre-processing steps are the same as before):

....
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")

Is there something wrong in my own implementation of SVM - OVA?

Solution

Is there something wrong in my own implementation of SVM - OVA?

You have unique classes array([3, 4, 5, 6, 7, 8, 9]), however the line Y_pred = prob_table.argmax(axis=1) assumes they are 0-indexed.

Try refactoring your code to be less error prone to assumptions like that:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-white.csv',  delimiter=';')
y = df["quality"]
my_classes = np.unique(y)
X = df.drop("quality", axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X,y, random_state=42)

# Models
clfs =  []

for k in my_classes:
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'
                   , probability=True, random_state=42)
    clfs.append(my_model.fit(X_train, Y_train==k))

# Prediction
prob_table = np.zeros((len(X_test),len(my_classes)))

for i,clf in enumerate(clfs):
    probs = clf.predict_proba(X_test)[:,1]
    prob_table[:,i] = probs
    
Y_pred = my_classes[prob_table.argmax(1)]
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)

from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf'
                              ,class_weight='balanced', random_state=42))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)

Test accuracy =  61.795918367346935
Test accuracy =  58.93877551020408

Note the difference in OVR based on probabilities, which is more fine grained and yields better results, vs one based on labels.

For further experiments you may wish to wrap classifier into a reusable class:

class OVRBinomial(BaseEstimator, ClassifierMixin):

    def __init__(self, cls):
        self.cls = cls

    def fit(self, X, y, **kwargs):
        self.classes_ = np.unique(y)
        self.clfs_ = []
        for c in self.classes_:
            clf = self.cls(**kwargs)
            clf.fit(X, y == c)
            self.clfs_.append(clf)
        return self

    def predict(self, X, **kwargs):
        probs = np.zeros((len(X), len(self.classes_)))
        for i, c in enumerate(self.classes_):
            prob = self.clfs_[i].predict_proba(X, **kwargs)[:, 1]
            probs[:, i] = prob
        idx_max = np.argmax(probs, 1)
        return self.classes_[idx_max]