Search code examples
stanford-nlp

Missing StanfordNLP Universal Dependency features in Java CoreNLP


Using the latest CoreNLP 3.9.2 Java API, I wish to extract the new Universal Dependencies features as they appear in the StanfordNLP Python library, and as defined here - universaldependencies.org/guidelines.html. Specifically:

  1. Multiword tokens
  2. POS tags in Universal Dependencies format (UPOS)
  3. Grammatical dependencies in UD format (using UPOS tags)

The current CoreNLP produces Penn Tree POS tags and dependencies as described here and here, respectively.

Pipeline config:

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
    props.setProperty("coref.algorithm", "neural");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    CoreDocument document = new CoreDocument(text);
    pipeline.annotate(document);

    CoreSentence sentence = document.sentences().get(0);
    sentence.posTags() // get pos tags
    sentence.dependencyParse() // dependency graph

Any help and clarification of my misunderstandings is much obliged.


Solution

  • The GitHub version of the code and models for French, German, and Spanish were trained on the CoNLL 2018 UD data, and support multi-word tokens.

    We may or may not train and English UD part-of-speech model.

    I believe the constituency parser data is using English-specific part-of-speech tags.

    These changes will be put into the 4.0.0 release which will hopefully be done before the end of the year.