Search code examples
pythonpandasdataframenlpspacy

Using nlp.pipe() in Spacy to get doc objects for Dataframe column


I am using Spacy nlp.pipe() for getting doc objects for text data in pandas Dataframe column but the parsed text returned as "text" in the code has length of only 32. However, the shape of dataframe is (14640, 16). Here is the data link if someone wants to read the data.

nlp = spacy.load("en_core_web_sm")
for text in nlp.pipe(iter(df['text']), batch_size = 1000, n_threads=-1):
  print(text)

len(text)

Result:

32

Can someone help me with this what is going on? What I am doing wrong?


Solution

  • According to the Spacy Documentation of Doc object here, the __len__ operator gets "the number of tokens in the document.".

    The last text in your data is:

    >>> df['text'].values[-1]
    @AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?
    

    After running the nlp.pipe() method, this sentence will be tokenized into 32 tokens which what you're asking for. To verfiy that, try runn the following code after len(text) and will get the exact result:

    >>> last_tokens = [token for token in text]
    >>> last_tokens
    [@AmericanAir, we, have, 8, ppl, so, we, need, 2, know, how, many, seats, are, on, the, next, flight, ., Plz, put, us, on, standby, for, 4, people, on, the, next, flight, ?]
    
    >>> len(last_tokens)
    32
    

    EDIT

    You can iterate over the tokens of each doc returned from the pipeline like so:

    nlp = spacy.load("en_core_web_sm")
    for text in nlp.pipe(iter(df['text']), batch_size = 1000, n_threads=-1):
        for token in text:
            print(token)
        print('\n')