I'm using Stanford NLP to do POS tagging for Spanish texts. I can get a POS Tag for each word but I notice that I am only given the first four sections of the Ancora tag and it's missing the last three sections for person, number and gender.
Why does Stanford NLP only use a reduced version of the Ancora tag?
Is it possible to get the entire tag using Stanford NLP?
Here is my code (please excuse the jruby...):
props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")
pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")
I am getting this as the output:
[Text=No CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=rn Lemma=no NamedEntityTag=O] [Text=sé CharacterOffsetBegin=3 CharacterOffsetEnd=5 PartOfSpeech=vmip000 Lemma=sé NamedEntityTag=O] [Text=qué CharacterOffsetBegin=6 CharacterOffsetEnd=9 PartOfSpeech=pt000000 Lemma=qué NamedEntityTag=O] [Text=estoy CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=vmip000 Lemma=estoy NamedEntityTag=O] [Text=haciendo CharacterOffsetBegin=16 CharacterOffsetEnd=24 PartOfSpeech=vmg0000 Lemma=haciendo NamedEntityTag=O] [Text=. CharacterOffsetBegin=24 CharacterOffsetEnd=25 PartOfSpeech=fp Lemma=. NamedEntityTag=O]
(I notice that the lemmas are incorrect also, but that's probably an issue for a separate question. Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)
Why does Stanford NLP only use a reduced version of the Ancora tag?
This was a practical decision made to ensure high tagging accuracy. (Retaining morphological information on tags caused the entire tagger to suffer from data sparsity, and do worse not only on morphological annotation but all over the board.)
Is it possible to get the entire tag using Stanford NLP?
No. You could get quite far doing this with a simple rule-based system, though, or use the Stanford Classifier to train your own morphological annotator. (Feel free to share your code if you pick either path!)