Why OpenNLP POSTaggerME is so slow?

This question has been asked here twice and didn't receive any answer. I shall try to be more informative.

Problem: I decided to rewrite a part-of-speech (POS) tagger in Java thinking that it should be much faster than a POS-tagger I wrote in python. For that purpose I decided to use OpenNLP POSTaggerME tagger. However after running the POSTaggerME on several text files I arrived at the conclusion that this tagger is much slower than a less accurate tagger I use in python. For example it took >3 min to tag "Alice in Wonderland" on intel 987 1.5Ghz 4GB RAM laptop and 74s on office i5 core 3.3Ghz 16GB RAM machine. While it takes less than a second for NLTK unigram pos-tagger.

Question: Since I am only learning Java I have suspicion that my code is not optimized and can be the reason for such drop in speed. This of course can be due to POSTaggerME being just slow, but I highly doubt it.

Can you tell if my code below has issues that may cause slow pos-tagging speed?

Here are the classes I think might cause performance slowdown. Full Github maven project is here: https://github.com/tastyminerals/POS-search-tool.git

Main class

imports (...)

public class MainApp {
public static void main(String[] args) {
    // Speed benchmark
    long start_time = System.currentTimeMillis();

    String file = "test/Alice_in_Wonderland.docx";
    Pair<String, ArrayList<String>> data = null;
    String sents[] = null;
    FileService fs = new FileService();

    /*
     * FileService returns a tuple with file textual data and an ArrayList
     * of file meta data
     */
    try {
        data = fs.getFileData(file);
    } catch (IOException | SAXException | TikaException e) {
        e.printStackTrace();
    }

    // Detecting sentences in data
    try {

        sents = SentDetection.getSents(data.getValue0());
    } catch (IOException e) {
        e.printStackTrace();
    }

    long end_time1 = System.currentTimeMillis();
    long difference = (end_time1 - start_time);
    System.out.println("SentDetection time: " + difference);

    // Tokenizing extracted sentences
    String[] ts = null;
    String[] tgs = null;
    try {
        //Loading model outside of POSTagging class to save resources
        POSModel model = new POSModelLoader().load(new File(
                "resources/models/pos/en-pos-maxent.bin"));

        for (String s: sents) {
            ts = Tokenizing.tokenize(s);
            tgs = POSTagging.tag(s, ts, model);


    //Printing the results          
//              int i = 0;
//              for (String t: ts) {
//                  System.out.print(t + "_" + tgs[i] + " ");
//                  i += 1;
//              }
//              System.out.println("");
            }


    } catch (IOException e) {
        e.printStackTrace();
    }

    // Speed benchmark
    long end_time3 = System.currentTimeMillis();
    long difference3 = (end_time3 - start_time) / 1000;
    System.out.println("POSTagging time: " + difference3 + "s");

   }
}

Tokenizer class

imports (...)
public class Tokenizing {
    public static String[] tokenize(String sentence)
            throws InvalidFormatException, IOException {
        // Load the corresponding tokenizer model
        InputStream is = new FileInputStream(
                "resources/models/token-detection/en-token.bin");
        TokenizerModel tmodel = new TokenizerModel(is);

        // Instantiate TokenizerME with a trained model and tokenize string
        Tokenizer tokenizer = new TokenizerME(tmodel);
        String tokens[] = tokenizer.tokenize(sentence);
        is.close();

        return tokens;
    }
}

POSTagger class

imports (...)
public class POSTagging {
    public static String[] tag(String sentence, String[] tokenizedSent,
            POSModel model) throws InvalidFormatException, IOException {
        // PerformanceMonitor perfMon = new PerformanceMonitor(System.err,
        // "sent");

        POSTaggerME tagger = new POSTaggerME(model);

        String[] taggedSent = tagger.tag(tokenizedSent);

        // System.out.println(Arrays.toString(taggedSent));
        // System.out.println(Arrays.toString(tokenizedSent));
        return taggedSent;
    }
}

Solution

Your test code is counting the time taken to load the models as well as the time taken to actually apply them to the text. And worse than that, you're reloading the tokenizer model once for each sentence rather than loading it once up-front and then applying it several times.

If you want to get a reliable measurement you will need to refactor your code to load all the models first, before you start timing, then run the sequence a few hundred or a few thousand times and take the average.