Search code examples
machine-learningscikit-learnsvmsupervised-learning

How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)


I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?


Solution

  • The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put

    C1 = C / n1
    C2 = C / n2
    

    where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.

    Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.

    Example using python and sklearn

    print __doc__
    
    import numpy as np
    import pylab as pl
    from sklearn import svm
    
    # we create 40 separable points
    rng = np.random.RandomState(0)
    n_samples_1 = 1000
    n_samples_2 = 100
    X = np.r_[1.5 * rng.randn(n_samples_1, 2),
              0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
    y = [0] * (n_samples_1) + [1] * (n_samples_2)
    
    # fit the model and get the separating hyperplane
    clf = svm.SVC(kernel='linear', C=1.0)
    clf.fit(X, y)
    
    w = clf.coef_[0]
    a = -w[0] / w[1]
    xx = np.linspace(-5, 5)
    yy = a * xx - clf.intercept_[0] / w[1]
    
    
    # get the separating hyperplane using weighted classes
    wclf = svm.SVC(kernel='linear', class_weight={1: 10})
    wclf.fit(X, y)
    
    ww = wclf.coef_[0]
    wa = -ww[0] / ww[1]
    wyy = wa * xx - wclf.intercept_[0] / ww[1]
    
    # plot separating hyperplanes and samples
    h0 = pl.plot(xx, yy, 'k-', label='no weights')
    h1 = pl.plot(xx, wyy, 'k--', label='with weights')
    pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
    pl.legend()
    
    pl.axis('tight')
    pl.show()
    

    In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.

    Visualization of above code from sklearn documentation