I am using the recent version of CoreNLP.
My task is to parse a text and get an output in conll format with CollapsedCCProcessedDependenciesAnnotation.
I run the following command
time java -cp $CoreNLP/javanlp-core.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props $CoreNLP/config.properties -file 12309959 -outputFormat conll
depparse.model = english_SD.gz
The problem is how to get CollapsedCCProcessedDependenciesAnnotation
.
I tried to use depparse.extradependencies in config.properties
but there is no parameter for CCProcessedDependenciesAnnotation
according to
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/GrammaticalStructure.Extras.html#REF_ONLY_COLLAPSED
Can you propose any solution how I can parse in conll with CollapsedCCProcessedDependenciesAnnotation
?
You can retrieve the CC-processed dependencies programmatically.
This question should serve as a good example (see the code in the example using the CollapsedCCProcessedDependenciesAnnotation
).
Gabor's answer from the mailing list explains this behavior very well (i.e., why you can't output collapsed dependencies directly):
Note that in general the collapsed cc processed dependencies won't output losslessly to conll though, as the format expects a tree (every word has a unique parent), and the dependencies can have multiple heads.
The output formatter therefore uses the basic dependencies only: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/CoNLLOutputter.java#L118. This could be changed in the code without crashing anything, but the serialized trees would be missing some edges, and ties for which edges are included would be broken somewhat arbitrarily. You may be better off writing your own logic for dumping to conll to fit your particular use case (you can probably copy much of our conll outputter code from above).