UPDATE
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_4/*/*/*/*.txt; do
[[ $f == *.xml ]] && continue # skip output files
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "$f" -outputDirectory .
done
this one seems to work better, but I'm getting a io exception file name too long
error, what is that about, how to fix it?
I guess the other command in the documentation is disfunctional
I was trying to use this script to process my corpus with the Stanford CoreNLP but I keep getting the error
Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP
This is the script
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
[[ $f == *.xml ]] && continue # skip output files
java -mx600m -cp $dir/Code/CoreNLP/stanford-corenlp-full-2015-01-29/stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g /Users/matthew/Workbench/Code/CoreNLP/stanford-corenlp-full-2015-01-29/edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file "$f" -outputDirectory $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/.
done
A very similar one worked for the Stanford NER, that one looked like this:
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
[[ $f == *_NER.txt ]] && continue # skip output files
g="${f%.txt}_NER.txt"
java -mx600m -cp $dir/Code/StanfordNER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier $dir/Code/StanfordNER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"
done
I can't figure out why I keep getting that error, it seems I've specified all the paths correctly.
I know there's the option -filelist parameter [which] points to a file whose content lists all files to be processed (one per line).
but I don't know how exactly that would work in my situation since my directory structure looks like this $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt
within which there are many files to be processed.
Also is it possible to dynamically specify -outputDirectory
they say in the docs You may specify an alternate output directory with the flag
but it seems like that would be called once and then be static which would be a nightmare scenario in my case.
I thought maybe I could just write some code to do this, also doesn't work, this is what I tried:
public static void main(String[] args) throws Exception
{
BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/01/1638802_output.txt"));
try
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null)
{
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String everything = sb.toString();
//System.out.println(everything);
Annotation doc = new Annotation(everything);
StanfordCoreNLP pipeline;
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and coreference resolution
Properties props = new Properties();
// configure pipeline
props.put(
"annotators",
"tokenize, ssplit"
);
pipeline = new StanfordCoreNLP(props);
pipeline.annotate(doc);
System.out.println( doc );
}
finally
{
br.close();
}
}
By far the best way to process a lot of files with Stanford CoreNLP is to arrange to load the system once - since loading all the various models takes 15 seconds or more depending on your computer before any actually document processing is done - and then to process a bunch of files with it. What you have in your update doesn't do that because running CoreNLP is inside the for
loop. A good solution is to use the for
loop to make a file list and then to run CoreNLP once on the file list. The file list is just a text file with one filename per line, so you can make it any way you want (using a script, editor macro, typing it in yourself), and you can and should check that its contents look correct before running CoreNLP. For your example, based on your update example, the following should work:
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
echo $f >> filelist.txt
done
# You can here check that filelist.txt has in it the files you want
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist filelist
# By default output files are written to the current directory, so you don't need to specify -outputDirectory .
Other notes on earlier tries:
-mx600m
isn't a reasonable way to run the full CoreNLP pipeline (right through parsing and coref). The sum of all it's models is just too large. -mx2g
is fine.-filelist
option, and if you use -textFiles
then the files are concatenated and become one output file, which you may well not want. At present, for NER, you may well need to run it inside the for
loop, as in your script for that.Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP
, but this is happening because you're putting a String (filename?) like that (perhaps with slashes rather than periods) where the java
command expects a class name. In that place, there should only be edu.stanford.nlp.pipeline.StanfordCoreNLP
as in your updated script or mine.outputDirectory
in one call to CoreNLP. You could get the effect that I think you want reasonably efficiently by making one call to CoreNLP per directory using two nested for
loops. The outer for
loop would iterate over directories, the inner one make a file list from all the files in that directory, which would then be processed in one call to CoreNLP and written to an appropriate output directory based on the input directory in the outer for
loop. Someone with more time or bash-fu than me could try to write that....You can certainly also write your own code to call CoreNLP, but then you're responsible for scanning input directories and writing to appropriate output files yourself. What you have looks basically okay, except the line System.out.println( doc );
won't do anything useful - it just prints out the test you began with. You need something like:
PrintWriter xmlOut = new PrintWriter("outputFileName.xml");
pipeline.xmlPrint(doc, xmlOut);