This question has been asked here twice and didn't receive any answer. I shall try to be more informative.
Problem: I decided to rewrite a part-of-speech (POS) tagger in Java thinking that it should be much faster than a POS-tagger I wrote in python. For that purpose I decided to use OpenNLP POSTaggerME tagger. However after running the POSTaggerME
on several text files I arrived at the conclusion that this tagger is much slower than a less accurate tagger I use in python. For example it took >3 min to tag "Alice in Wonderland" on intel 987 1.5Ghz 4GB RAM laptop and 74s on office i5 core 3.3Ghz 16GB RAM machine. While it takes less than a second for NLTK unigram pos-tagger.
Question: Since I am only learning Java I have suspicion that my code is not optimized and can be the reason for such drop in speed. This of course can be due to POSTaggerME being just slow, but I highly doubt it.
Can you tell if my code below has issues that may cause slow pos-tagging speed?
Here are the classes I think might cause performance slowdown. Full Github maven project is here: https://github.com/tastyminerals/POS-search-tool.git
Main class
imports (...)
public class MainApp {
public static void main(String[] args) {
// Speed benchmark
long start_time = System.currentTimeMillis();
String file = "test/Alice_in_Wonderland.docx";
Pair<String, ArrayList<String>> data = null;
String sents[] = null;
FileService fs = new FileService();
/*
* FileService returns a tuple with file textual data and an ArrayList
* of file meta data
*/
try {
data = fs.getFileData(file);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// Detecting sentences in data
try {
sents = SentDetection.getSents(data.getValue0());
} catch (IOException e) {
e.printStackTrace();
}
long end_time1 = System.currentTimeMillis();
long difference = (end_time1 - start_time);
System.out.println("SentDetection time: " + difference);
// Tokenizing extracted sentences
String[] ts = null;
String[] tgs = null;
try {
//Loading model outside of POSTagging class to save resources
POSModel model = new POSModelLoader().load(new File(
"resources/models/pos/en-pos-maxent.bin"));
for (String s: sents) {
ts = Tokenizing.tokenize(s);
tgs = POSTagging.tag(s, ts, model);
//Printing the results
// int i = 0;
// for (String t: ts) {
// System.out.print(t + "_" + tgs[i] + " ");
// i += 1;
// }
// System.out.println("");
}
} catch (IOException e) {
e.printStackTrace();
}
// Speed benchmark
long end_time3 = System.currentTimeMillis();
long difference3 = (end_time3 - start_time) / 1000;
System.out.println("POSTagging time: " + difference3 + "s");
}
}
Tokenizer class
imports (...)
public class Tokenizing {
public static String[] tokenize(String sentence)
throws InvalidFormatException, IOException {
// Load the corresponding tokenizer model
InputStream is = new FileInputStream(
"resources/models/token-detection/en-token.bin");
TokenizerModel tmodel = new TokenizerModel(is);
// Instantiate TokenizerME with a trained model and tokenize string
Tokenizer tokenizer = new TokenizerME(tmodel);
String tokens[] = tokenizer.tokenize(sentence);
is.close();
return tokens;
}
}
POSTagger class
imports (...)
public class POSTagging {
public static String[] tag(String sentence, String[] tokenizedSent,
POSModel model) throws InvalidFormatException, IOException {
// PerformanceMonitor perfMon = new PerformanceMonitor(System.err,
// "sent");
POSTaggerME tagger = new POSTaggerME(model);
String[] taggedSent = tagger.tag(tokenizedSent);
// System.out.println(Arrays.toString(taggedSent));
// System.out.println(Arrays.toString(tokenizedSent));
return taggedSent;
}
}
Your test code is counting the time taken to load the models as well as the time taken to actually apply them to the text. And worse than that, you're reloading the tokenizer model once for each sentence rather than loading it once up-front and then applying it several times.
If you want to get a reliable measurement you will need to refactor your code to load all the models first, before you start timing, then run the sequence a few hundred or a few thousand times and take the average.