Search code examples
pythondataframenlpvariable-assignmentspacy

Python: how to assign output from spacy to a list of tuples and then convert to a DataFrame?


I am trying to assign the printed output of a for-loop to a variable parsed_generics.

This is the printed output:

import spacy

nlp = spacy.load("en")
doc = nlp(generics)
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Aerobics Aerobics nsubj is
a form form attr is
physical exercise exercise pobj of
rhythmic aerobic exercise exercise dobj combines
stretching and strength training routines routines pobj with
the goal goal pobj with
all elements elements dobj improving
...

To assign that to a variable, this is what I have written:

nlp = spacy.load("en")
doc = nlp(generics)
for chunk in doc.noun_chunks:
    parsed_generics = (chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

But when I call parsed_generics, this is what I get:

('predators', 'predators', 'pobj', 'of')

I guess what I am expecting is a list of tuples:

[('Aerobics', 'Aerobics', 'nsubj', 'is'), ('a form', 'form', 'attr', 'is'), ('physical exercise', 'exercise', 'pobj', 'of'), ...]

I guess I have to set up an empty list above my for-loop, iterate over doc and append to the empty list, but append takes only 1 argument and I have 4 (chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

I eventually want to store this in a DataFrame.

Any advice or suggestions would be very much appreciated. Thank you in advance.


Solution

  • You need to use append. You are overwriting parsed_generics every iteration, meaning what you're seeing is the last line in the iteration.

    Append each iteration to a list, than call the list after.

    result = []
    
    nlp = spacy.load("en")
    doc = nlp(generics)
    for chunk in doc.noun_chunks:
        result.append((chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text))