Search code examples
parsingstanford-nlp

CoreNLP ConLL format with CollapsedCCProcessedDependenciesAnnotation


I am using the recent version of CoreNLP.

My task is to parse a text and get an output in conll format with CollapsedCCProcessedDependenciesAnnotation.

I run the following command

time java -cp $CoreNLP/javanlp-core.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props $CoreNLP/config.properties -file 12309959  -outputFormat conll


depparse.model = english_SD.gz

The problem is how to get CollapsedCCProcessedDependenciesAnnotation.

I tried to use depparse.extradependencies in config.properties

but there is no parameter for CCProcessedDependenciesAnnotation according to http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/GrammaticalStructure.Extras.html#REF_ONLY_COLLAPSED

Can you propose any solution how I can parse in conll with CollapsedCCProcessedDependenciesAnnotation?


Solution

  • You can retrieve the CC-processed dependencies programmatically.

    This question should serve as a good example (see the code in the example using the CollapsedCCProcessedDependenciesAnnotation).


    Gabor's answer from the mailing list explains this behavior very well (i.e., why you can't output collapsed dependencies directly):

    Note that in general the collapsed cc processed dependencies won't output losslessly to conll though, as the format expects a tree (every word has a unique parent), and the dependencies can have multiple heads.

    The output formatter therefore uses the basic dependencies only: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/CoNLLOutputter.java#L118. This could be changed in the code without crashing anything, but the serialized trees would be missing some edges, and ties for which edges are included would be broken somewhat arbitrarily. You may be better off writing your own logic for dumping to conll to fit your particular use case (you can probably copy much of our conll outputter code from above).