Search code examples
nlpstanford-nlp

Stanford NLP: OutOfMemoryError


I am annotating and analyzing a series of text files.

The pipeline.annotate method becomes increasingly slow each time it reads a file. Eventually, I get an OutOfMemoryError.

Pipeline is initialized ONCE:

protected void initializeNlp()
{
    Log.getLogger().debug("Starting Stanford NLP");


    // creates a StanfordCoreNLP object, with POS tagging, lemmatization,
    // NER, parsing, and
    Properties props = new Properties();

    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse,  natlog,  openie");
    props.put("regexner.mapping", namedEntityPropertiesPath);

    pipeline = new StanfordCoreNLP(props);


    Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n");
}

I then process each file using same instance of pipeline (as recommended elsewhere on SO and by Stanford).

     public void processFile(Path file)
{
    try
    {
        Instant start = Instant.now();

        Annotation document = new Annotation(cleanString);
        Log.getLogger().info("ANNOTATE");
        pipeline.annotate(document);
        Long millis= Duration.between(start, Instant.now()).toMillis();
        Log.getLogger().info("Annotation Duration in millis: "+millis);

        AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file);

        processSentences(af, document);

        Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n");



        Long millis1= Duration.between(start, Instant.now()).toMillis();
        Log.getLogger().info("Total Duration in millis: "+millis1);

        allFiles.put(file.toUri().toString(), af);


    }
    catch (Exception e)
    {
        Log.getLogger().debug(e.getMessage(), e);
    }

}

To be clear, I expect the problem is with my configuration. However, I am certain that the stall and memory issues occur at the pipeline.annotate(file) method.

I dispose of all references to Stanford-NLP objects other than pipeline (e.g., CoreLabel) after processing each file. That is, I do not keep references to any Stanford objects in my code beyond the method level.

Any tips or guidance would be deeply appreciated


Solution

  • OK, that last sentence of the question made me go double check. The answer is that I WAS keeping reference to CoreMap in one of my own classes. In other words, I was keeping in memory all the Trees, Tokens and other analyses for every sentence in my corpus.

    In short, keep StanfordNLP CoreMaps for a given number of sentences and then dispose.

    (I expect a hard core computational linguist would say there is rarely any need to keep a CoreMap once it has been analyzed, but I have to declare my neophyte status here)