Search code examples
javasvmlibsvm

How to train data correctly using libsvm?


I want to use SVM (Support vector machine) in my program, but I could not get the true result.

I want to know that how we must train data for SVM.

What I am doing:

Think that we have 5 document (the numbers are just an example), 3 of them is on first category and others (2 of them) are on second category, I merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document), after that I made a train array like this:

double[][] train = new double[cat1.getDocument().getAttributes().size() + cat2.getDocument().getAttributes().size()][];

and I will fill the array like this:

int i = 0;
    Iterator<String> iteraitor = cat1.getDocument().getAttributes().keySet().iterator();
    Iterator<String> iteraitor2 = cat2.getDocument().getAttributes().keySet().iterator();
    while (i < train.length) {
        if (i < cat2.getDocument().getAttributes().size()) {
            while (iteraitor2.hasNext()) {
                String key = (String) iteraitor2.next();
                Long value = cat2.getDocument().getAttributes().get(key);
                double[] vals = { 0, value };
                train[i] = vals;
                i++;
                System.out.println(vals[0] + "," + vals[1]);
            }
        } else {
            while (iteraitor.hasNext()) {
                String key = (String) iteraitor.next();
                Long value = cat1.getDocument().getAttributes().get(key);
                double[] vals = { 1, value };
                train[i] = vals;
                i++;
                System.out.println(vals[0] + "," + vals[1]);
            }
            i++;
        }

so I will continue like this to get the model :

svm_problem prob = new svm_problem();
    int dataCount = train.length;
    prob.y = new double[dataCount];
    prob.l = dataCount;
    prob.x = new svm_node[dataCount][];

    for (int k = 0; k < dataCount; k++) {
        double[] features = train[k];
        prob.x[k] = new svm_node[features.length - 1];
        for (int j = 1; j < features.length; j++) {
            svm_node node = new svm_node();
            node.index = j;
            node.value = features[j];
            prob.x[k][j - 1] = node;
        }
        prob.y[k] = features[0];
    }
    svm_parameter param = new svm_parameter();
    param.probability = 1;
    param.gamma = 0.5;
    param.nu = 0.5;
    param.C = 1;
    param.svm_type = svm_parameter.C_SVC;
    param.kernel_type = svm_parameter.LINEAR;
    param.cache_size = 20000;
    param.eps = 0.001;
    svm_model model =  svm.svm_train(prob, param);

Is this way correct? if not please help me to make it true.


these two answers are true : answer one , answer two,


Solution

  • Even without examining the code one can find conceptual errors:

    think that we have 5 document , 3 of them is on first category and others( 2 of them) are on second category , i merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document ) ,after that i made a train array like this

    So:

    • training on the 5 documents won't give any reasonable effects, with any machine learning model... these are statistical models,there is no reasonable statistics in 5 points in R^n, where n~10,000
    • You do not merge anything. Such approach can work for Naive Bayes, which do not really treat documents as "whole" but rather - as probabilistic dependencies between features and classes. In SVM each document should be separate point in the R^n space, where n can be number of distinct words (for bag of words/set of words representation).