I am using a Stanford STANZA pipeline on some (italian) text.
Problem I'm grappling with is that I need data from BOTH the Token and Word objects.
While I'm able to access one or the other separately I'm not wrapping my head on how to get data from both in a single loop over the Document -> Sentence
Specifically I need both some Word data (such as lemma, upos and head) but I also need to know the corresponding start and end position, which in my understanding I can find in the token.start_char and token.end_char.
Here's my code to test what I've achieved:
import stanza
IN_TXT = '''Il paziente Rossi e' stato ricoverato presso il nostro reparto a seguito di accesso
al pronto soccorso con diagnosi sospetta di aneurisma aorta
addominale sottorenale. In data 12/11/2022 e' stato sottoposto ad asportazione dell'aneurisma
con anastomosi aorto aortica con protesi in dacron da 20mm. Paziente dimesso in data odierna in
condizioni stabili.'''
stanza.download('it', verbose=False)
it_nlp = stanza.Pipeline('it', processors='tokenize,lemma,pos,depparse,ner',
verbose=False, use_gpu=False)
it_doc = it_nlp(IN_TXT)
# iterate through the Token objects
T = 0
for token in it_doc.iter_tokens():
T += 1
token_id = 'T' + str((T))
token_start = token.start_char
token_end = token.end_char
token_text = token.text
print(f"{token_id}\t{token_start} {token_end} {token_text}")
# iterate through Word objects
print(*[f'word: {word.text}\t\t\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in it_doc.sentences for word in sent.words], sep='\n')
Here is the documentation of these objects: https://stanfordnlp.github.io/stanza/data_objects.html
I just discovered the zip function which returns an iterator of tuples in Python 3.
Therefore to iterate in parallel through the Words and Tokens of a sentence you can code:
for sentence in it_doc.sentences:
for t, w in zip(sentence.tokens, sentence.words):
print(f"Text->{w.text}\tLemma->{w.lemma}\tStart->{t.start_char}\tStop->{t.end_char}")