stanford-nlp pos-tagger dependency-parsing

Stanford pos-tagger incremental training

We've been using Stanford CoreNLP for a while and most of the time it delivered correct results.

But for certain sentences the dependency parsing results mess up. As we observed, some of these errors are caused by POS tagging issue, e.g. the word like in I really like this restaurant., or the word ambient in Very affordable and excellent ambient!

Yes we are dealing with user reviews which might have slightly different wording with the training corpus in Stanford CoreNLP, so we are thinking of annotating some text ourselves and mix with the existing model. For NER we already had our own model for special NEs but for POS-tagging and dependency parsing we have no clue.

Could anyone provide any suggestions?

Solution

The best thing to do is use CoNLL-U data.

There are English treebanks available here: https://universaldependencies.org/

There are examples of properties files for various part-of-speech models we've trained here (also in the models jars):

https://github.com/stanfordnlp/CoreNLP/tree/master/scripts/pos-tagger

Here is an example part-of-speech training command:

java -Xmx10g edu.stanford.nlp.tagger.maxent.MaxentTagger -props custom.props

Note that you want to use this format to specify which files to use for training and evaluating on CoNLL-U files:

trainFile = format=TSV,wordColumn=1,tagColumn=3,/path/to/train.conllu

Here you are specifying you use a tab separated file (which has one token per line, empty line for sentence breaks), and you are saying which columns represent the word and the tag respectively.

Here is an example command for training the dependency parser:

java edu.stanford.nlp.parser.nndep.DependencyParser -Xmx10g -trainFile <trainPath> -devFile <devPath> -embedFile <wordEmbeddingFile> -embeddingSize <wordEmbeddingDimensionality> -model nndep.model.txt.gz

One thing to be aware of is the notion of a UPOS tag and an XPOS tag. The UPOS tags are expected to be in column 3, whereas the XPOS are in column 4. The UPOS are universal for all languages, the XPOS are fine-grained and language specific.

The -cPOS flag will tell the training process to use the UPOS tags which are in column index 3. If you don't add this flag it will use column index 4 by default as in the example command.

This command should work and train a model properly with CoNLL-U data if you use the latest code for Stanford CoreNLP from GitHub. If you are using code from 3.9.2, you will need to make sure to translate your data from CoNLL-U to CoNLL-X. CoNLL-X is an older style that doesn't include info about multi-word tokens.

Also, for your model to perform optimally, you need to make sure you are using tokenization in your overall application that is consistent with the training data.