Stanford NNDep parser: features used

In regards to Stanford’s neural network dependency parser*̦ which features are used during training and testing phases? In practice, which columns in a CONLLᶸ ˣ formatted data set could be substituted with _ without the parser loosing any accuracy when training? Which columns are never read?

Certainly ID, FORM and HEAD (columns # 1, 2 & 7) are a must, as most likely are U/C-POSTAG (# 4) and DEPREL (# 8). But how about the columns LEMMA, (X)-POSTAG and FEATS (# 3, 5 & 6)? Do they help while training, or whether the treebank contains any information in these is irrelevant for the parser?

Solution

In the current implementation, we only use the following fields. My column indexing begins from 1.

FORM (column 2)
UPOSTAG (column 4) [^1]
HEAD (column 7)
DEPREL (column 8)

[^1]: If parsing with coarse part-of-speech tags (-cPOS), we read column 5 instead.

Everything else can be null, so long as you don't break the CoNLL format (i.e., still include a _ in the null column).

See exactly which columns we read here: edu.stanford.nlp.parser.nndep.Util.loadConllFile. Note these are the same for both CoNLL-X and CoNLL-U representations.