Search code examples
javastanford-nlp

How to extract an unlabelled/untyped dependency tree from a TreeAnnotation using Stanford CoreNLP?


The target language is Spanish.

The English pipeline has support for typed dependencies whereas the Spanish pipeline, to my knowledge, does not.

The goal is to produce a dependency tree from a TreeAnnotation where the end result is a list of directed edges. Is this possible with CoreNLP 3.4.1 and using Spanish models, if so: how?

Background

I'm using Stanford CoreNLP 3.4.1 + (3.5.0 Spanish models for POS tagging) (Due to compatibility reasons, Java 8 cannot be used yet) with the following configuration:

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, ner, parse");
props.setProperty("tokenize.options", "invertible=true,ptb3Escaping=true");
props.setProperty("tokenize.language", "es");

props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger");
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz");

props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/spanishSR.ser.gz"); //Stanford Parser 3.4.1 shift-reduce models for Spanish. 

props.setProperty("ner.applyNumericClassifiers", "false");
props.setProperty("ner.useSUTime", "false");

Which is then used to create the pipeline and run annotation of a document.

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);

List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

for(CoreMap sentence: sentences) {

    // ... extract start, end position of sentence ...

    for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {

        // ... extract POS tags, NER annotations, id ...
    }

    //This works, and I have a tree that is not empty.
    Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
}

By using a debugger I was able to examine both sentences and tokens and conclude that they have the following content:

Sentence (keys)

From edu.stanford.nlp.ling.CoreAnnotations:

  • TextAnnotation
  • CharacterOffsetBeginAnnotation
  • CharacterOffsetEndAnnotation
  • TokensAnnotation
  • TokenBeginAnnotation
  • TokenEndAnnotation
  • SentenceIndexAnnotation

From edu.stanford.nlp.trees.TreeCoreAnnotations

  • TreeAnnotation

Tokens (keys)

From edu.stanford.nlp.ling.CoreAnnotations

  • TextAnnotation
  • OriginalTextAnnotation
  • CharacterOffsetBeginAnnotation
  • CharacterOffsetEndAnnotation
  • BeforeAnnotation
  • AfterAnnotation
  • IndexAnnotation
  • SentenceIndexAnnotation
  • PartOfSpeechAnnotation
  • NamedEntityTagAnnotation

From edu.stanford.nlp.trees.TreeCoreAnnotations

  • HeadWordAnnotation - In my experiments: this one always points to itself, i.e. the token where the annotation is retrieved from.
  • HeadTagAnnotation

Thanks in advance!


Solution

  • There is no support for Spanish dependency parsing in CoreNLP at the moment. This includes typed dependency conversion from constituency parses.

    There is a head finder implemented (but not fully tested). You could hack an untyped dependency converter using this head finder, but we have no guarantees that this will yield a sensible parse.