I am annotating and analyzing a series of text files.
The pipeline.annotate method becomes increasingly slow each time it reads a file. Eventually, I get an OutOfMemoryError.
Pipeline is initialized ONCE:
protected void initializeNlp()
{
Log.getLogger().debug("Starting Stanford NLP");
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse, natlog, openie");
props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n");
}
I then process each file using same instance of pipeline (as recommended elsewhere on SO and by Stanford).
public void processFile(Path file)
{
try
{
Instant start = Instant.now();
Annotation document = new Annotation(cleanString);
Log.getLogger().info("ANNOTATE");
pipeline.annotate(document);
Long millis= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Annotation Duration in millis: "+millis);
AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file);
processSentences(af, document);
Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n");
Long millis1= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Total Duration in millis: "+millis1);
allFiles.put(file.toUri().toString(), af);
}
catch (Exception e)
{
Log.getLogger().debug(e.getMessage(), e);
}
}
To be clear, I expect the problem is with my configuration. However, I am certain that the stall and memory issues occur at the pipeline.annotate(file) method.
I dispose of all references to Stanford-NLP objects other than pipeline (e.g., CoreLabel) after processing each file. That is, I do not keep references to any Stanford objects in my code beyond the method level.
Any tips or guidance would be deeply appreciated
OK, that last sentence of the question made me go double check. The answer is that I WAS keeping reference to CoreMap in one of my own classes. In other words, I was keeping in memory all the Trees, Tokens and other analyses for every sentence in my corpus.
In short, keep StanfordNLP CoreMaps for a given number of sentences and then dispose.
(I expect a hard core computational linguist would say there is rarely any need to keep a CoreMap once it has been analyzed, but I have to declare my neophyte status here)