Search code examples
pythonmachine-learningscikit-learnsvm

Predict training data in sklearn


I use scikit-learn's SVM like so:

clf = svm.SVC()
clf.fit(td_X, td_y) 

When I use the classifier to predict the class of a member of the training set, could the classifier ever be wrong even in scikit-learn implementation (eg. clf.predict(td_X[a])==td_Y[a])?


Solution

  • Yes definitely, run this code for example:

    from sklearn import svm
    import numpy as np
    clf = svm.SVC()
    np.random.seed(seed=42)
    x=np.random.normal(loc=0.0, scale=1.0, size=[100,2])
    y=np.random.randint(2,size=100)
    clf.fit(x,y)
    print(clf.score(x,y))
    

    The score is 0.61, so nearly 40% of the training data is missclassified. Part of the reason is that even though the default kernel is 'rbf' (which in theory should be able to classify perfectly any training data set, as long as you don't have two identical training points with different labels), there is also regularization to reduce overfitting. The default regularizer is C=1.0.

    If you run the same code as above but switch clf = svm.SVC() to clf = svm.SVC(C=200000), you'll get an accuracy of 0.94.