In regards to Stanford’s neural network dependency parser*̦ which features are used during training and testing phases? In practice, which columns in a CONLLᶸˣ formatted data set could be substituted with _ without the parser loosing any accuracy when training? Which columns are never read?
Certainly ID
, FORM
and HEAD
(columns # 1, 2 & 7) are a must, as most likely are U/C-POSTAG
(# 4) and DEPREL
(# 8). But how about the columns LEMMA
, (X)-POSTAG
and FEATS
(# 3, 5 & 6)? Do they help while training, or whether the treebank contains any information in these is irrelevant for the parser?
In the current implementation, we only use the following fields. My column indexing begins from 1.
FORM
(column 2)UPOSTAG
(column 4) [^1]HEAD
(column 7)DEPREL
(column 8)[^1]: If parsing with coarse part-of-speech tags (-cPOS
), we read column 5 instead.
Everything else can be null, so long as you don't break the CoNLL format (i.e., still include a _
in the null column).
See exactly which columns we read here: edu.stanford.nlp.parser.nndep.Util.loadConllFile
. Note these are the same for both CoNLL-X and CoNLL-U representations.