Search code examples
machine-learningscikit-learnsvmlibsvmsklearn-pandas

Confused about sklearn’s implementation of OSVM


I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed this example .

I apologize for the silly questions But I’m a bit confused about two things :

  1. Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?

  2. Which of labels predicted by the OSVM and represent outliers is it 1 or -1

Once again i apologize for those questions but for some reason i cannot find this documented anyware


Solution

  • As this example you reference is about novelty-detection, the docs say:

    novelty detection:

    The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.

    Meaning: you should train on regular examples only.

    The approach is based on:

    Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.

    Extract:

    Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.

    We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.

    The above docs also say:

    Inliers are labeled 1, while outliers are labeled -1.

    This can also be seen in your example code, extracted:

    # Generate some regular novel observations
    X = 0.3 * np.random.randn(20, 2)
    X_test = np.r_[X + 2, X - 2]
    ...
    # all regular = inliers (defined above)
    y_pred_test = clf.predict(X_test)  
    ...
    # -1 = outlier <-> error as assumed to be inlier
    n_error_test = y_pred_test[y_pred_test == -1].size