java classification training-data opennlp

testing OpenNLP classifier model

I'm currently training a model for a classifier. yesterday I found out that it will be more accurate if you also test the created classify model. I tried searching on the internet how to test a model : testing openNLP model. But I cant get it to work. I think the reason is because i'm using OpenNLP version 1.83 instead of 1.5. Could anyone explain me how to properly test my model in this version of OpenNLP?

Thanks in advance.

Below is the way im training my model:

public static DoccatModel trainClassifier() throws IOException
    {
        // read the training data
        final int iterations = 100;
        InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/trainingssetTest.txt"));
        ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
        ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

        // define the training parameters
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, iterations+"");
        params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
        params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

        // create a model from traning data
        DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());

        return model;
    }

Solution

I can think of two ways to test your model. Either way, you will need to have annotated documents (an by annotated I really mean expert-classified).

The first way involves using the opennlp DocCatEvaluator. The syntax would be something akin to

opennlp DoccatEvaluator -model model -data sampleData

The format of your sampleData should be

OUTCOME <document text....>

documents are separated by the new line character.

The second way involves creating an DocumentCategorizer. Something like: (the model is the DocCat model from your question)

DocumentCategorizer categorizer = new DocumentCategorizerME(model);

// could also use: Tokenizer tokenizer = new TokenizerME(tokenizerModel)
Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE();

 // linesample is like in your question...
for(String sample=linesample.read(); sample != null; sample=linesample.read()){
    String[] tokens = tokenizer.tokenize(sample);
    double[] outcomeProb = categorizer.categorize(tokens);
    String sampleOutcome = categorizer.getBestCategory(outcomeProb);

  // check if the outcome is right...
  // keep track of # right and wrong...
}
// calculate agreement metric of your choice

Since I typed the code here there may be a syntax error or two (either I or the SO community can fix), but the idea for running through your data, tokenizing, running it through the document categorizer and keeping track of the results is how you want to evaluate your model.

Hope it helps...