Stanford NLP Annotation pipeline.annotate resulting into OutOfMemoryError in Java

So we are using Stanford NLP to annotate input text, and these input texts are laughably small. Below is one example of the same.

"Can you give me details about Mohammad Siva John with identifiers 6745-3876-1354-8790 and 313-31-333"

Below is the Java code snippet to annotate.

    final Properties properties = new Properties();
    properties.setProperty("annotators", "tokenize, ssplit, pos, lemma");
    final StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
    final Annotation document = new Annotation(text);
    pipeline.annotate(document);

Below is the Maven dependency.

    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>4.5.4</version>
    </dependency>

This works fine, but after couple of days, the JVM crashes with a coredump. Coredump analysis shows that below line resulted into OutOfMemoryError

pipeline.annotate(document);

Any thoughts on how to resolve this? There are no field level variables in the class, and all of them are method level, and so should be 'freed' once the execution is done. So, there should be no OutOfMemoryError first of all to begin with.

Quite perplexing. Any thoughts?

Solution

So found this in StanfordNLP Javadoc of void edu.stanford.nlp.pipeline.StanfordCoreNLP.clearAnnotatorPool()

Call this if you are no longer using StanfordCoreNLP and want torelease the memory associated with the annotators.

Calling this method itself almost has no impact, but the real game changer is calling 'System.gc()'. Below is the fix for this, just call clearAnnotatorPool and gc after annotate.

final Properties properties = new Properties();
properties.setProperty("annotators", "tokenize, ssplit, pos, lemma");
final StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
final Annotation document = new Annotation(text);
pipeline.annotate(document);

// Below two calls would fix memory issue.
StanfordCoreNLP.clearAnnotatorPool();
System.gc();

This is how used memory is with only StanfordCoreNLP.clearAnnotatorPool() call. Note that this is without System.gc() call. Notice that for 1000 calls to 'annotate' in a loop the used memory crosses 2500 MB and then falls down below 500 MB. This is the case when we are letting JVM call gc.

However, when I call both StanfordCoreNLP.clearAnnotatorPool(); and System.gc(); the results are drastically different. Note that used memory is within the range of 132 to 133 MB irrespective of whether the annotation pool has been cleared.

One starts seeing the real of 'StanfordCoreNLP.clearAnnotatorPool()' around after 150 pipeline.annotate hits. Below is memory utilization observed over 1000 hits to pipeline.annotate. Observe that with 'StanfordCoreNLP.clearAnnotatorPool() and System.gc()', the memory utilization hovers just above 40 MB.

If you rightly understand, this only signifies that somehow the JVM default gc execution is not releasing as much memory as when an explicit call is being made. I understand there would be timing difference and all, which only means that one needs to further research on the default gc and the JDK on being used (am on JDK 21 with IDE enforcing JDK 17 compliance level. The graphs are all almost the same when run on a JDK 17 directly too) and how this changes with various gc strategies.

But then I'm happy with this and conclude! Hope this helps someone.