I am working on text data having shape of (14640,16) using Pandas and Spacy for preprocessing but having issue in getting lemmetized form of text. Moreover, if I work with pandas series (i.e dataframe with one column) which contain only text column there are different issue with that also.
Code: (Dataframe)
nlp = spacy.load("en_core_web_sm")
df['parsed_tweets'] = df['text'].apply(lambda x: nlp(x))
df[:3]
After this I iterate over the column with parsed_tweets to get lemmetized data but get the error.
Code:
for token in df['parsed_tweets']:
print(token.lemma_)
Code: (Pandas Series)
df1['tweets'] = df['text']
nlp = spacy.load("en_core_web_sm")
for text in nlp.pipe(iter(df1), batch_size = 1000, n_threads=-1):
print(text)
Can someone help me with the errors? I tried other stackoverflow solution but can't get doc object of Spacy to iterate over it and get tokens and lemmetized tokens. What am I doing wrong?
#you can directly get your lemmatized token by running list comprehension in your lambda function
df['parsed_tweets'] = df['text'].apply(lambda x: [y.lemma_ for y in nlp(x)])
print(type(df['parsed_tweets'][0]))
#op
spacy.tokens.doc.Doc
for i in range(df.shape[0]):
for word in df['parsed_tweets'][i]:
print(word.lemma_)
#op
play
football
i
be
work
hard