Search code examples
pythonnlpstanford-nlp

Obtaining data from both token and word objects in a Stanza Document / Sentence


I am using a Stanford STANZA pipeline on some (italian) text.

Problem I'm grappling with is that I need data from BOTH the Token and Word objects.

While I'm able to access one or the other separately I'm not wrapping my head on how to get data from both in a single loop over the Document -> Sentence

Specifically I need both some Word data (such as lemma, upos and head) but I also need to know the corresponding start and end position, which in my understanding I can find in the token.start_char and token.end_char.

Here's my code to test what I've achieved:

import stanza
IN_TXT = '''Il paziente Rossi e' stato ricoverato presso il nostro reparto a seguito di accesso
  al pronto soccorso con diagnosi sospetta di aneurisma aorta
  addominale sottorenale. In data 12/11/2022 e' stato sottoposto ad asportazione dell'aneurisma
  con anastomosi aorto aortica con protesi in dacron da 20mm. Paziente dimesso in data odierna in 
  condizioni stabili.'''
stanza.download('it', verbose=False)
it_nlp = stanza.Pipeline('it', processors='tokenize,lemma,pos,depparse,ner',
                         verbose=False, use_gpu=False)
it_doc = it_nlp(IN_TXT)
# iterate through the Token objects
T = 0
for token in it_doc.iter_tokens():
    T += 1
    token_id = 'T' + str((T))
    token_start = token.start_char
    token_end = token.end_char
    token_text = token.text
    print(f"{token_id}\t{token_start} {token_end} {token_text}")
# iterate through Word objects
print(*[f'word: {word.text}\t\t\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in it_doc.sentences for word in sent.words], sep='\n')

Here is the documentation of these objects: https://stanfordnlp.github.io/stanza/data_objects.html


Solution

  • I just discovered the zip function which returns an iterator of tuples in Python 3.

    Therefore to iterate in parallel through the Words and Tokens of a sentence you can code:

    for sentence in it_doc.sentences:
        for t, w in zip(sentence.tokens, sentence.words):
            print(f"Text->{w.text}\tLemma->{w.lemma}\tStart->{t.start_char}\tStop->{t.end_char}")