Search code examples
classificationmalletsvmlight

What is the correct svmlight input format in Mallet?


I am using Mallet with the SVMLight input format to do classification usingNaiveBayes classifier. But I get a NumberFormatException. I'm wondering how I can use strings features when using SVMLight. As I read in the guideline 1, the features can also be strings.

Can anyone help me what is wrong with my code or input?

Here is my code:

public void trainMalletNaiveBayes() throws Exception {

        ArrayList<Pipe> pipes = new ArrayList<Pipe>();
        pipes.add(new SvmLight2FeatureVectorAndLabel());
        pipes.add(new PrintInputAndTarget());

        SerialPipes pipe = new SerialPipes(pipes);

        //prepare training instances
        InstanceList trainingInstanceList = new InstanceList(pipe);

        trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/featureFiles_svm.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));

        //prepare test instances
        InstanceList testingInstanceList = new InstanceList(pipe);
        testingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/test_set.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));

        ClassifierTrainer trainer = new NaiveBayesTrainer();
        Classifier classifier = trainer.train(trainingInstanceList);

And here is the first three lines of my input file:

No f1:NP f2:NN f3:1 f4:1 f5:0 f6:0 f7:0 f8:0.0 f9:1 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:NN f17:NOTHING
No f1:NP f2:NN f3:8 f4:4 f5:0 f6:0 f7:1 f8:4.127134385045092 f9:8 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:DT f17:NOTHING
Yes f1:NP f2:NN f3:4 f4:3 f5:0 f6:0 f7:0 f8:0.0 f9:4 f10:true f11:false f12:false f13:false f14:false f15:NP f16:DT f17:NN

The first column is the label of the instance and there rest of the data includes the features and their values. For example, NN shows the POS of the head word of a phrase.

In the meantime, I get the exception for the NN (NumberFormatException: For input string: "NN") . I'm wondering why it doesn't have any problem with the NP which comes before that, but stops at the NN.


Solution

  • All features need to have numeric values. For booleans you can use true=1 and false=0. You would also have to modify f1:NP to f1_NP=1.

    The reason it's not dying on the NP is that the SvmLight2FeatureVectorAndLabel class is expecting to parse an entire line (label and data), but the code is reading the file with a CsvIterator that is splitting off the first element as a label.

    The classify.tui.SvmLight2Vectors class uses this code for an iterator:

    new SelectiveFileLineIterator (fileReader, "^\\s*#.+")