Search code examples
stanford-nlp

stanford coreNLP process many files with a script


UPDATE

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_4/*/*/*/*.txt; do
    [[ $f == *.xml ]] && continue # skip output files
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "$f" -outputDirectory .  
done

this one seems to work better, but I'm getting a io exception file name too long error, what is that about, how to fix it?

I guess the other command in the documentation is disfunctional


I was trying to use this script to process my corpus with the Stanford CoreNLP but I keep getting the error

Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP

This is the script

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
    [[ $f == *.xml ]] && continue # skip output files
    java -mx600m -cp $dir/Code/CoreNLP/stanford-corenlp-full-2015-01-29/stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g /Users/matthew/Workbench/Code/CoreNLP/stanford-corenlp-full-2015-01-29/edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file "$f" -outputDirectory $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/. 
done

A very similar one worked for the Stanford NER, that one looked like this:

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
    [[ $f == *_NER.txt ]] && continue # skip output files
    g="${f%.txt}_NER.txt"
    java -mx600m -cp $dir/Code/StanfordNER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier $dir/Code/StanfordNER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"
done

I can't figure out why I keep getting that error, it seems I've specified all the paths correctly.

I know there's the option -filelist parameter [which] points to a file whose content lists all files to be processed (one per line).

but I don't know how exactly that would work in my situation since my directory structure looks like this $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt within which there are many files to be processed.

Also is it possible to dynamically specify -outputDirectory they say in the docs You may specify an alternate output directory with the flag but it seems like that would be called once and then be static which would be a nightmare scenario in my case.

I thought maybe I could just write some code to do this, also doesn't work, this is what I tried:

public static void main(String[] args) throws Exception 
{

    BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/01/1638802_output.txt"));
    try 
    {
        StringBuilder sb = new StringBuilder();
        String line = br.readLine();

        while (line != null) 
        {

            sb.append(line);
            sb.append(System.lineSeparator());
            line = br.readLine();
        }
        String everything = sb.toString();
        //System.out.println(everything);

        Annotation doc = new Annotation(everything);

        StanfordCoreNLP pipeline;

        // creates a StanfordCoreNLP object, with POS tagging, lemmatization,
        // NER, parsing, and coreference resolution
        Properties props = new Properties();

        // configure pipeline
        props.put(
                  "annotators", 
                  "tokenize, ssplit"
                  );

        pipeline = new StanfordCoreNLP(props);

        pipeline.annotate(doc);

        System.out.println( doc );

    }
    finally 
    {
        br.close();
    }

}

Solution

  • By far the best way to process a lot of files with Stanford CoreNLP is to arrange to load the system once - since loading all the various models takes 15 seconds or more depending on your computer before any actually document processing is done - and then to process a bunch of files with it. What you have in your update doesn't do that because running CoreNLP is inside the for loop. A good solution is to use the for loop to make a file list and then to run CoreNLP once on the file list. The file list is just a text file with one filename per line, so you can make it any way you want (using a script, editor macro, typing it in yourself), and you can and should check that its contents look correct before running CoreNLP. For your example, based on your update example, the following should work:

    dir=/Users/matthew/Workbench
    for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
        echo $f >> filelist.txt
    done
    # You can here check that filelist.txt has in it the files you want
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist filelist
    # By default output files are written to the current directory, so you don't need to specify -outputDirectory .
    

    Other notes on earlier tries:

    • -mx600m isn't a reasonable way to run the full CoreNLP pipeline (right through parsing and coref). The sum of all it's models is just too large. -mx2g is fine.
    • The best way above doesn't fully extend to the NER case. Stanford NER doesn't take a -filelist option, and if you use -textFiles then the files are concatenated and become one output file, which you may well not want. At present, for NER, you may well need to run it inside the for loop, as in your script for that.
    • I haven't quite decoded how you're getting the error Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP, but this is happening because you're putting a String (filename?) like that (perhaps with slashes rather than periods) where the java command expects a class name. In that place, there should only be edu.stanford.nlp.pipeline.StanfordCoreNLP as in your updated script or mine.
    • You can't have a dynamic outputDirectory in one call to CoreNLP. You could get the effect that I think you want reasonably efficiently by making one call to CoreNLP per directory using two nested for loops. The outer for loop would iterate over directories, the inner one make a file list from all the files in that directory, which would then be processed in one call to CoreNLP and written to an appropriate output directory based on the input directory in the outer for loop. Someone with more time or bash-fu than me could try to write that....
    • You can certainly also write your own code to call CoreNLP, but then you're responsible for scanning input directories and writing to appropriate output files yourself. What you have looks basically okay, except the line System.out.println( doc ); won't do anything useful - it just prints out the test you began with. You need something like:

      PrintWriter xmlOut = new PrintWriter("outputFileName.xml");
      pipeline.xmlPrint(doc, xmlOut);