Search code examples

Efficient batch processing with Stanford CoreNLP

Is it possible to speed up batch processing of documents with CoreNLP from command line so that models load only one time? I would like to trim any unnecessarily repeated steps from the process.

I have 320,000 text files and I am trying to process them with CoreNLP. The desired result is 320,000 finished XML file results.

To get from one text file to one XML file, I use the CoreNLP jar file from command line:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props 
-file %%~f -outputDirectory MyOutput -outputExtension .xml -replaceExtension`

This loads models and does a variety of machine learning magic. The problem I face is when I try to loop for every text in a directory, I create a process that by my estimation will complete in 44 days. I literally have had a command prompt looping on my desktop for the last 7 days and I'm nowhere near finished. The loop I run from batch script:

for %%f in (Data\*.txt) do (
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props
    -file %%~f -outputDirectory Output -outputExtension .xml -replaceExtension

I am using these annotators, specified in
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment


  • I know nothing about Stanford CoreNLP, so I googled for it (you didn't included any link) and in this page I found this description (below "Parsing a file and saving the output as XML"):

    If you want to process a list of files use the following command line:

    java -cp stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR CONFIGURATION FILE ] -filelist A FILE CONTAINING YOUR LIST OF FILES

    where the -filelist parameter points to a file whose content lists all files to be processed (one per line).

    So I guess that you may process your files faster if you store a list of all your text files in a list file:

    dir /B *.txt > list.lst

    ... and then pass that list in the -filelist list.lst parameter in a single execution of Stanford CoreNLP.