Search code examples
nlpstanford-nlp

Stanford NNDep parser: features used


In regards to Stanford’s neural network dependency parser which features are used during training and testing phases? In practice, which columns in a CONLLˣ formatted data set could be substituted with _ without the parser loosing any accuracy when training? Which columns are never read?

Certainly ID, FORM and HEAD (columns # 1, 2 & 7) are a must, as most likely are U/C-POSTAG (# 4) and DEPREL (# 8). But how about the columns LEMMA, (X)-POSTAG and FEATS (# 3, 5 & 6)? Do they help while training, or whether the treebank contains any information in these is irrelevant for the parser?


Solution

  • In the current implementation, we only use the following fields. My column indexing begins from 1.

    • FORM (column 2)
    • UPOSTAG (column 4) [^1]
    • HEAD (column 7)
    • DEPREL (column 8)

    [^1]: If parsing with coarse part-of-speech tags (-cPOS), we read column 5 instead.

    Everything else can be null, so long as you don't break the CoNLL format (i.e., still include a _ in the null column).

    See exactly which columns we read here: edu.stanford.nlp.parser.nndep.Util.loadConllFile. Note these are the same for both CoNLL-X and CoNLL-U representations.