Search code examples
javanlptext-classificationnaivebayesopennlp

OpenNLP-Document Categorizer- how to classify documents based on status; language of docs not English, also default features?


I want to classify my documents using OpenNLP's Document Categorizer, based on their status: pre-opened, opened, locked, closed etc.

I have 5 classes and I'm using the Naive Bayes algorithm, 60 documents in my training set, and trained my set on 1000 iterations with 1 cut off param.

But no success, when I test them I don't get good results. I was thinking maybe it is because of the language of the documents (is not in English) or maybe I should somehow add the statuses as features. I have set the default features in the categorizer, and also I'm not very familiar with them.

The result should be locked, but its categorized as opened.

InputStreamFactory in=null;
try {
in= new MarkableFileInputStreamFactory(new 
File("D:\\JavaNlp\\doccategorizer\\doccategorizer.txt"));
}
catch (FileNotFoundException e2) {
System.out.println("Creating new input stream");
e2.printStackTrace();
}

ObjectStream lineStream=null;
ObjectStream sampleStream=null;

try {
lineStream = new PlainTextByLineStream(in, "UTF-8");
sampleStream = new DocumentSampleStream(lineStream);            
}
catch (IOException e1) {
System.out.println("Document Sample Stream");
e1.printStackTrace();
}


TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 1000+"");
params.put(TrainingParameters.CUTOFF_PARAM, 1+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, 
NaiveBayesTrainer.NAIVE_BAYES_VALUE);


DoccatModel model=null;
try {
model = DocumentCategorizerME.train("en", sampleStream, params, new 
DoccatFactory());
} 
catch (IOException e) 
{
System.out.println("Training...");
e.printStackTrace();
}


System.out.println("\nModel is successfully trained.");


BufferedOutputStream modelOut=null;

try {
modelOut = new BufferedOutputStream(new 
FileOutputStream("D:\\JavaNlp\\doccategorizer\\classifier-maxent.bin"));
} 
catch (FileNotFoundException e) {

System.out.println("Creating output stream");
e.printStackTrace();
}
try {
model.serialize(modelOut);
}
catch (IOException e) {

System.out.println("Serialize...");
e.printStackTrace();
}
System.out.println("\nTrained model is kept in: 
"+"model"+File.separator+"en-cases-classifier-maxent.bin");

DocumentCategorizer doccat = new DocumentCategorizerME(model);
String[] docWords = "Some text here...".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = doccat.categorize(docWords);


System.out.println("\n---------------------------------\nCategory : 
Probability\n---------------------------------");
for(int i=0;i<doccat.getNumberOfCategories();i++){
System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
}
System.out.println("---------------------------------");

System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the category 
for the given sentence");

results results2

Can someone make a suggestion for me how to categorize my documents well, like should I add a language detector first, or add new features?

Thanks in advance


Solution

  • By default, the document classifier takes the document text and forms a bag of words. Each word in the bag becomes a feature. As long as the language can be tokenized by an English tokenizer (again by default a white space tokenizer), I would guess that the language is not your problem. I would check the format of the data you are using for the training data. It should be formatted like this:

    category<tab>document text
    

    The text should fit be one line. The opennlp documentation for the document classifier can be found at http://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training.tool

    It would be helpful if you could provide a line or two of training data to help examine the format.

    Edit: Another potential issue. 60 documents may not be enough documents to train a good classifier, particularly if you have a large vocabulary. Also, even though this is not English, please tell me it is not multiple languages. Finally, is the document text the best way to classify the document? Would metadata from the document itself produce better features.

    Hope it helps.