We've been using Stanford CoreNLP for a while and most of the time it delivered correct results.
But for certain sentences the dependency parsing results mess up. As we observed, some of these errors are caused by POS tagging issue, e.g. the word like
in I really like this restaurant.
, or the word ambient
in Very affordable and excellent ambient!
Yes we are dealing with user reviews which might have slightly different wording with the training corpus in Stanford CoreNLP, so we are thinking of annotating some text ourselves and mix with the existing model. For NER we already had our own model for special NEs but for POS-tagging and dependency parsing we have no clue.
Could anyone provide any suggestions?
The best thing to do is use CoNLL-U data.
There are English treebanks available here: https://universaldependencies.org/
There are examples of properties files for various part-of-speech models we've trained here (also in the models jars):
https://github.com/stanfordnlp/CoreNLP/tree/master/scripts/pos-tagger
Here is an example part-of-speech training command:
java -Xmx10g edu.stanford.nlp.tagger.maxent.MaxentTagger -props custom.props
Note that you want to use this format to specify which files to use for training and evaluating on CoNLL-U files:
trainFile = format=TSV,wordColumn=1,tagColumn=3,/path/to/train.conllu
Here you are specifying you use a tab separated file (which has one token per line, empty line for sentence breaks), and you are saying which columns represent the word and the tag respectively.
Here is an example command for training the dependency parser:
java edu.stanford.nlp.parser.nndep.DependencyParser -Xmx10g -trainFile <trainPath> -devFile <devPath> -embedFile <wordEmbeddingFile> -embeddingSize <wordEmbeddingDimensionality> -model nndep.model.txt.gz
One thing to be aware of is the notion of a UPOS
tag and an XPOS
tag. The UPOS
tags are expected to be in column 3, whereas the XPOS
are in column 4. The UPOS
are universal for all languages, the XPOS
are fine-grained and language specific.
The -cPOS
flag will tell the training process to use the UPOS
tags which are in column index 3. If you don't add this flag it will use column index 4 by default as in the example command.
This command should work and train a model properly with CoNLL-U data if you use the latest code for Stanford CoreNLP from GitHub. If you are using code from 3.9.2, you will need to make sure to translate your data from CoNLL-U to CoNLL-X. CoNLL-X is an older style that doesn't include info about multi-word tokens.
Also, for your model to perform optimally, you need to make sure you are using tokenization in your overall application that is consistent with the training data.