Search code examples
pythonscikit-learnsvmmulticlass-classification

Something wrong when implementing SVM One-vs-all in python


I was trying to verify that I had correctly understood how SVM - OVA (One-versus-All) works, by comparing the function OneVsRestClassifier with my own implementation.

In the following code, I implemented num_classes classifiers in the training phase, and then tested all of them on the testset and selected the one returning the highest probability value.

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale

# Read dataset 
df = pd.read_csv('In/winequality-white.csv',  delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)

# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]

# From dataset to features and labels
X_train = train.loc[:, train.columns != 'quality']
Y_train = train.loc[:, train.columns == 'quality']
X_test = test.loc[:, test.columns != 'quality']
Y_test = test.loc[:, test.columns == 'quality']

# Models
clf =  [None] * num_classes
for k in np.arange(0,num_classes):
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced', probability=True)
    clf[k] = my_model.fit(X_train, Y_train==my_classes[k])

# Prediction
prob_table = np.zeros((len(Y_test), num_classes))
for k in np.arange(0,num_classes):
    p = clf[k].predict_proba(X_test)
    prob_table[:,k] = p[:,list(clf[k].classes_).index(True)]
Y_pred = prob_table.argmax(axis=1)

print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n") 

Test accuracy is equal to 0.21, while when using the function OneVsRestClassifier, it returns 0.59. For completeness, I also report the other code (the pre-processing steps are the same as before):

....
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")

Is there something wrong in my own implementation of SVM - OVA?


Solution

  • Is there something wrong in my own implementation of SVM - OVA?

    You have unique classes array([3, 4, 5, 6, 7, 8, 9]), however the line Y_pred = prob_table.argmax(axis=1) assumes they are 0-indexed.

    Try refactoring your code to be less error prone to assumptions like that:

    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score,classification_report
    from sklearn.preprocessing import scale
    from sklearn.model_selection import train_test_split
    df = pd.read_csv('winequality-white.csv',  delimiter=';')
    y = df["quality"]
    my_classes = np.unique(y)
    X = df.drop("quality", axis=1)
    
    X_train, X_test, Y_train, Y_test = train_test_split(X,y, random_state=42)
    
    # Models
    clfs =  []
    
    for k in my_classes:
        my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'
                       , probability=True, random_state=42)
        clfs.append(my_model.fit(X_train, Y_train==k))
    
    # Prediction
    prob_table = np.zeros((len(X_test),len(my_classes)))
    
    for i,clf in enumerate(clfs):
        probs = clf.predict_proba(X_test)[:,1]
        prob_table[:,i] = probs
        
    Y_pred = my_classes[prob_table.argmax(1)]
    print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)
    
    from sklearn.multiclass import OneVsRestClassifier
    clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf'
                                  ,class_weight='balanced', random_state=42))
    clf.fit(X_train, Y_train)
    Y_pred = clf.predict(X_test)
    print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)
    

    Test accuracy =  61.795918367346935
    Test accuracy =  58.93877551020408
    

    Note the difference in OVR based on probabilities, which is more fine grained and yields better results, vs one based on labels.

    For further experiments you may wish to wrap classifier into a reusable class:

    class OVRBinomial(BaseEstimator, ClassifierMixin):
    
        def __init__(self, cls):
            self.cls = cls
    
        def fit(self, X, y, **kwargs):
            self.classes_ = np.unique(y)
            self.clfs_ = []
            for c in self.classes_:
                clf = self.cls(**kwargs)
                clf.fit(X, y == c)
                self.clfs_.append(clf)
            return self
    
        def predict(self, X, **kwargs):
            probs = np.zeros((len(X), len(self.classes_)))
            for i, c in enumerate(self.classes_):
                prob = self.clfs_[i].predict_proba(X, **kwargs)[:, 1]
                probs[:, i] = prob
            idx_max = np.argmax(probs, 1)
            return self.classes_[idx_max]