Search code examples
pythonmachine-learningsvmlibsvmdata-representation

multiclass representation of LIBSVM


My goal is to make a multiclass classifier, to work with different files, which will be labeled with at least two classes (or labels). These files are parliamentary initiatives so each will be indexed in a thesaurus in a minimum of one pair of values.

I'm using ‘libsvm’ in the version of python, because the removal of stopwords, tokenization and stemming seemed easier to do it in python, thanks to tools like Snowball, NLTK ...

This version can´t directly use the multi-classification
However, it is possible to program a multiclass classifier models generating a total of k * (k-1) / 2 (where 'k' is the number of classes).

The representation for LIBSVM is:

<class/target>[ <attribute number>:<attribute value>]*   

Then for a file with 5 classes, should I generate 5 times the previous line only changing the class?

For example:

1 1:3 2:4 6:5….
2 1:3 2:4 6:5….
3 1:3 2:4 6:5….
4 1:3 2:4 6:5….
5 1:3 2:4 6:5….

Thanks and regards.


Solution

  • You are confusing

    • multiclass scenario - where there are more than 2 classes in general but each object is assigned exactly one
    • multilabel scenario - where there are multiple labels assigned to each object

    SVM cannot do either of the above in its basic formulation/implementation. Although both these problems can be easily decomposed.

    First one is often approached using one vs all or one vs one, both implemented in scikit-learn, where you have python binding to libsvm.

    Your scenario looks rather like multilabel, in such a case basic svm can be only used by splitting your problem to K independent ones, simply create K distinct training sets, each answering the question "Does given file have label i?" and train K distinct SVMs, each simply gives you one bit of your answer (we assume that labeling procedures are independent, which is a simplification, but other approach would require structural SVM approach, like the one available in svmstruct).

    You cannot create a single libsvm training file for multilabel classification. Documentation you cite is refering to multiclass, which is not your case, and simply requires using K different label names, not replicating rows.