python-3.x machine-learning scikit-learn svm ensemble-learning

SVM model averaging in sklearn

l would like to average the scores of two different SVMs trained on different samples but same classes

# Data have the smae label x_1[1] has y_1[1] and x_2[1] has y_2[1]
# Where y_2[1] == y_1[1]
Dataset_1=(x_1,y)
Dataset_2=(x_2,y)
test_data=(test_sample,test_labels)

We have 50 classes. Same classes for dataset_1 and dataset_2 :

list(set(y_1))=list(set(y_2))

What l have tried :

from sklearn.svm import SVC

clf_1 = SVC(kernel='linear', random_state=42).fit(x_1, y)

clf_2 = SVC(kernel='linear', random_state=42).fit(x_2, y)

How to average clf_1 and clf_2 scores before doing :

predict(test_sample)

What l would like to do ?

Solution

Not sure I understand your question; to simply average the scores as in a typical ensemble, you should first get prediction probabilities from each model separately, and then just take their average:

pred1 = clf_1.predict_proba(test_sample)
pred2 = clf_2.predict_proba(test_sample)
pred = (pred1 + pred2)/2

In order to get prediction probabilities instead of hard classes, you should initialize the SVC using the additional argument probability=True.

Each row of pred will be an array of length 50, as many as your classes, with each element representing the probability that the sample belongs to the respective class.

After averaging, simply take the argmax of pred - just be sure that the order of the returned probabilities is OK; according to the docs:

The columns correspond to the classes in sorted order, as they appear in the attribute classes_

As I am not exactly sure what this means, run some checks with predictions on your training set, to be sure that the order is correct.