Search code examples
pythonpandasdataframeseriesspacy

lemmatization issue using Spacy in pandas Series and Dataframe


I am working on text data having shape of (14640,16) using Pandas and Spacy for preprocessing but having issue in getting lemmetized form of text. Moreover, if I work with pandas series (i.e dataframe with one column) which contain only text column there are different issue with that also.

Code: (Dataframe)

nlp = spacy.load("en_core_web_sm")
df['parsed_tweets'] = df['text'].apply(lambda x: nlp(x))
df[:3]

Result: Result

After this I iterate over the column with parsed_tweets to get lemmetized data but get the error.

Code:

for token in df['parsed_tweets']:
  print(token.lemma_)

Error: Error

Code: (Pandas Series)

df1['tweets'] = df['text']

nlp = spacy.load("en_core_web_sm")
for text in nlp.pipe(iter(df1), batch_size = 1000, n_threads=-1):
  print(text)

Error: Error

Can someone help me with the errors? I tried other stackoverflow solution but can't get doc object of Spacy to iterate over it and get tokens and lemmetized tokens. What am I doing wrong?


Solution

  • #you can directly get your lemmatized token by running list comprehension in your lambda function  
    
    df['parsed_tweets'] = df['text'].apply(lambda x: [y.lemma_ for y in  nlp(x)])
    

    enter image description here

    print(type(df['parsed_tweets'][0]))
    #op
    spacy.tokens.doc.Doc
    
    
    for i in range(df.shape[0]):
        for word in df['parsed_tweets'][i]:
            print(word.lemma_)
    #op
    play
    football
    i
    be
    work
    hard