I'm implementing stanza's lemmatizer because it works well with spanish texts but the lemmatizer retuns a whole dictionary with ID and other characteristics I don't care about for the time being. I checked the "processors" in the pipeline but I don't seem to find and example where I just get the sence with the lemmatized text instead of the dictionary.
This is what I have:
stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=False)
stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True)
stNLP('me hubiera gustado mas “sincronia” con la primaria')
Output:
[
[
{
"id": 1,
"text": "me",
"lemma": "yo",
"upos": "PRON",
"xpos": "pp1cs000",
"feats": "Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs",
"start_char": 0,
"end_char": 2
},
....
Of course when I try to lemmatize my document it returns a lot of text I don't need at the moment, how can I just obtain the lemma? I'm aware I could possibly extract the word from the dictionary but it takes a lot of time as it is, what I want to avoid is giving the fuction extra work.
Thank you in advance.
I'm not entirely sure yet, but from what I've seen, it appears that Stanza's pipeline generates a nested structure in which each sentence is a list of tokens, and each token is akin to a dictionary containing various attributes such as ID, text, lemma, and so on.
It is easy to extract the lemmas by navigating this nested structure. Here's how I've done it.
stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=False)
stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True)
doc = stNLP('me hubiera gustado mas “sincronia” con la primaria')
lemmas = [word.lemma for t in doc.iter_tokens() for word in t.words]
Note: As of the time of writing, the version of Stanza being used is stanza==1.7.0